An integrated package for supervised learning, using over 50 kinds of models, and a variety of different metrics:
Applied Predictive Modeling
M. Kuhn and K. Johnson
Springer-Verlag, 2013.
ISBN: 978-1-4614-6848-6 (Print)
http://link.springer.com/book/10.1007%2F978-1-4614-6849-3
[APM] is similar to [ISL] and [ESL] but emphasizes practical model development/evaluation, including case histories with R scripts.
Its caret library (http://caret.r-forge.r-project.org) is integrated with R packages that support Supervised Learning using mainstream model evaluation methods and 100 popular models.
caret package manual (PDF): http://cran.r-project.org/web/packages/caret/caret.pdf
List of models in caret (reproduced as a table below): http://caret.r-forge.r-project.org/modelList.html
caret package overview: http://www.jstatsoft.org/v28/i05/paper
%load_ext rpy2.ipython
%%R
not.installed <- function(pkg) !is.element(pkg, installed.packages()[,1])
%%R
if (not.installed("caret")) install.packages("caret")
library(caret)
library(help=caret)
Documentation for package 'caret'
Information on package 'caret'
Description:
Package: caret
Version: 6.0-41
Date: 2015-01-02
Title: Classification and Regression Training
Author: Max Kuhn. Contributions from Jed Wing, Steve
Weston, Andre Williams, Chris Keefer, Allan
Engelhardt, Tony Cooper, Zachary Mayer, Brenton
Kenkel, the R Core Team, Michael Benesty, Reynald
Lescarbeau, Andrew Ziem, and Luca Scrucca.
Description: Misc functions for training and plotting
classification and regression models
Maintainer: Max Kuhn <Max.Kuhn@pfizer.com>
Depends: R (>= 2.10), stats, lattice (>= 0.20), ggplot2
URL: http://caret.r-forge.r-project.org/
Imports: car, reshape2, foreach, methods, plyr, nlme,
BradleyTerry2
Suggests: e1071, earth (>= 2.2-3), fastICA, gam, ipred,
kernlab, klaR, MASS, ellipse, mda, mgcv, mlbench,
nnet, party (>= 0.9-99992), pls, pROC, proxy,
randomForest, RANN, spls, subselect, pamr, superpc,
Cubist, testthat (>= 0.9.1)
License: GPL (>= 2)
NeedsCompilation: yes
Packaged: 2015-01-02 18:40:43 UTC; kuhna03
Repository: CRAN
Date/Publication: 2015-01-03 06:58:41
Built: R 3.1.2; x86_64-apple-darwin13.4.0; 2015-01-04
06:10:33 UTC; unix
Index:
BloodBrain Blood Brain Barrier Data
BoxCoxTrans.default Box-Cox and Exponential Transformations
GermanCredit German Credit Data
as.table.confusionMatrix
Save Confusion Table Results
avNNet.default Neural Networks Using Model Averaging
bag.default A General Framework For Bagging
bagEarth Bagged Earth
bagFDA Bagged FDA
calibration Probability Calibration Plot
caretFuncs Backwards Feature Selection Helper Functions
caretSBF Selection By Filtering (SBF) Helper Functions
cars Kelly Blue Book resale data for 2005 model year
GM cars
classDist Compute and predict the distances to class
centroids
confusionMatrix Create a confusion matrix
confusionMatrix.train Estimate a Resampled Confusion Matrix
cox2 COX-2 Activity Data
createDataPartition Data Splitting functions
dhfr Dihydrofolate Reductase Inhibitors Data
diff.resamples Inferential Assessments About Model Performance
dotPlot Create a dotplot of variable importance values
dotplot.diff.resamples
Lattice Functions for Visualizing Resampling
Differences
downSample Down- and Up-Sampling Imbalanced Data
dummyVars Create A Full Set of Dummy Variables
featurePlot Wrapper for Lattice Plotting of Predictor
Variables
filterVarImp Calculation of filter-based variable importance
findCorrelation Determine highly correlated variables
findLinearCombos Determine linear combinations in a matrix
format.bagEarth Format 'bagEarth' objects
gafs.default Genetic algorithm feature selection
gafs_initial Ancillary genetic algorithm functions
histogram.train Lattice functions for plotting resampling
results
icr.formula Independent Component Regression
index2vec Convert indicies to a binary vector
knn3 k-Nearest Neighbour Classification
knnreg k-Nearest Neighbour Regression
lift Lift Plot
maxDissim Maximum Dissimilarity Sampling
mdrr Multidrug Resistance Reversal (MDRR) Agent Data
modelLookup Tools for Models Available in 'train'
nearZeroVar Identification of near zero variance predictors
nullModel Fit a simple, non-informative model
oil Fatty acid composition of commercial oils
oneSE Selecting tuning Parameters
panel.lift2 Lattice Panel Functions for Lift Plots
panel.needle Needle Plot Lattice Panel
pcaNNet.default Neural Networks with a Principal Component Step
plot.gafs Plot Method for the gafs and safs Classes
plot.rfe Plot RFE Performance Profiles
plot.train Plot Method for the train Class
plot.varImp.train Plotting variable importance measures
plotClassProbs Plot Predicted Probabilities in Classification
Models
plotObsVsPred Plot Observed versus Predicted Results in
Regression and Classification Models
plsda Partial Least Squares and Sparse Partial Least
Squares Discriminant Analysis
postResample Calculates performance across resamples
pottery Pottery from Pre-Classical Sites in Italy
prcomp.resamples Principal Components Analysis of Resampling
Results
preProcess Pre-Processing of Predictors
predict.bagEarth Predicted values based on bagged Earth and FDA
models
predict.gafs Predict new samples
predict.knn3 Predictions from k-Nearest Neighbors
predict.knnreg Predictions from k-Nearest Neighbors Regression
Model
predict.train Extract predictions and class probabilities
from train objects
predictors List predictors used in the model
print.confusionMatrix Print method for confusionMatrix
print.train Print Method for the train Class
resampleHist Plot the resampling distribution of the model
statistics
resampleSummary Summary of resampled performance estimates
resamples Collation and Visualization of Resampling
Results
rfe Backwards Feature Selection
rfeControl Controlling the Feature Selection Algorithms
safs.default Simulated annealing feature selection
safsControl Control parameters for GA and SA feature
selection
safs_initial Ancillary simulated annealing functions
sbf Selection By Filtering (SBF)
sbfControl Control Object for Selection By Filtering (SBF)
segmentationData Cell Body Segmentation
sensitivity Calculate sensitivity, specificity and
predictive values
spatialSign Compute the multivariate spatial sign
summary.bagEarth Summarize a bagged earth or FDA fit
tecator Fat, Water and Protein Content of Meat Samples
train Fit Predictive Models over Different Tuning
Parameters
trainControl Control parameters for train
train_model_list A List of Available Models in train
twoClassSim Simulation Functions
update.safs Update or Re-fit a SA or GA Model
update.train Update or Re-fit a Model
varImp Calculation of variable importance for
regression and classification models
varImp.gafs Variable importances for GAs and SAs
xyplot.resamples Lattice Functions for Visualizing Resampling
Results
xyplot.rfe Lattice functions for plotting resampling
results of recursive feature selection
Further information is available in the following vignettes in
directory
'/Library/Frameworks/R.framework/Versions/3.1/Resources/library/caret/doc':
caret: A Short Introduction to the caret Package (source, pdf)
Loading required package: lattice Loading required package: ggplot2
[APM] includes examples using the following packages/models:
C5.0, J48, M5, Nelder-Mead, PART, avNNet, cforest, ctree, cubist, earth, enet, fda, gbm, glm, glmnet, knn, lda, lm, mda, nb, nnet, pam, pcr, pls, rf, ridge, rpart, sparseLDA, svmPoly, svmRadial, treebag}
Installing and getting R example scripts:
install.packages("AppliedPredictiveModeling")
library(AppliedPredictiveModeling)
getPackages(1:19) # download ALL packages used in chs 1-19, including caret
%%R
if (not.installed("AppliedPredictiveModeling")) {
install.packages("AppliedPredictiveModeling")
library(AppliedPredictiveModeling)
for (chapter in c(2,3,4,6,7,8,10, 11,12,13,14,16,17,19)) getPackages(chapter)
} else {
library(AppliedPredictiveModeling)
}
library(help=AppliedPredictiveModeling)
Documentation for package 'AppliedPredictiveModeling'
Information on package 'AppliedPredictiveModeling'
Description:
Package: AppliedPredictiveModeling
Type: Package
Title: Functions and Data Sets for 'Applied Predictive
Modeling'
Version: 1.1-6
Date: 2014-07-24
Author: Max Kuhn, Kjell Johnson
Maintainer: Max Kuhn <mxkuhn@gmail.com>
Description: A few functions and several data set for the
Springer book 'Applied Predictive Modeling'
URL: http://appliedpredictivemodeling.com/
Depends: R (>= 2.10)
Imports: CORElearn, MASS, plyr, reshape2
Suggests: caret (>= 6.0-22), lattice, ellipse
License: GPL
Packaged: 2014-07-25 13:37:54 UTC; kuhna03
NeedsCompilation: no
Repository: CRAN
Date/Publication: 2014-07-25 18:42:22
Built: R 3.1.2; ; 2015-01-10 01:37:59 UTC; unix
Index:
AppliedPredictiveModeling-package
Data, Functions and Scripts for
'scriptLocation'
ChemicalManufacturingProcess
Chemical Manufacturing Process Data
abalone Abalone Data
bio Hepatic Injury Data
bookTheme Lattice Themes
cars2010 Fuel Economy Data
concrete Compressive Strength of Concrete from Yeh
(1998)
diagnosis Alzheimer's Disease CSF Data
getPackages Install Packages for Each Chapter
logisticCreditPredictions
Logistic Regression Predictions for the Credit
Data
permeability Permeability Data
permuteRelief Permutation Statistics for the Relief Algorithm
quadBoundaryFunc Functions for Simulating Data
schedulingData HPC Job Scheduling Data
scriptLocation Find Chapter Script Files
segmentationOriginal Cell Body Segmentation
trainX Solubility Data
twoClassData Two Class Example Data
%%R
# Grid Search is often used in APM to search a model's parameter space, and
# some chapters use the "doMC" package to do Multi-Core computation
# (supported only on Linux or MacOS):
if (not.installed("doMC")) install.packages("doMC") # multicore computation in R
library(doMC)
library(help=doMC)
Documentation for package 'doMC'
Information on package 'doMC'
Description:
Package: doMC
Type: Package
Title: Foreach parallel adaptor for the
multicore package
Version: 1.3.3
Author: Revolution Analytics
Maintainer: Revolution Analytics
<packages@revolutionanalytics.com>
Description: Provides a parallel backend for the
%dopar% function using the
multicore functionality of the
parallel package..
Depends: R (>= 2.14.0), foreach(>= 1.2.0),
iterators(>= 1.0.0), parallel
Imports: utils
Enhances: compiler, RUnit
License: GPL-2
Repository: CRAN
Repository/R-Forge/Project: domc
Repository/R-Forge/Revision: 16
Repository/R-Forge/DateTimeStamp: 2014-02-25 19:29:46
Date/Publication: 2014-02-28 07:00:48
Packaged: 2014-02-25 23:42:04 UTC; rforge
NeedsCompilation: no
OS_type: unix
Built: R 3.1.2; ; 2015-01-09 23:21:34 UTC;
unix
Index:
doMC-package The doMC Package
registerDoMC registerDoMC
Further information is available in the following vignettes in
directory
'/Library/Frameworks/R.framework/Versions/3.1/Resources/library/doMC/doc':
gettingstartedMC: Getting Started with doMC and foreach (source, pdf)
Loading required package: foreach foreach: simple, scalable parallel programming from Revolution Analytics Use Revolution R for scalability, fault tolerance and more. http://www.revolutionanalytics.com Loading required package: iterators Loading required package: parallel
import pandas as pd
pd.set_option('display.max_rows', 500)
ModelDF = pd.read_table("AppliedPredictiveModelingModelTable.tsv")
print(ModelDF.describe())
# print(ModelDF.columns)
#
# colnames = list(ModelDF)
# print(colnames)
#
# print(ModelDF.ix[:,0:2]) # print first two columns
ModelDF
Model method Argument Value Type Packages \
count 180 180 180 176
unique 159 180 3 88
top Partial Least Squares logicBag Classification kernlab
freq 4 1 72 17
Tuning Parameters
count 180
unique 115
top None
freq 24
| Model | method Argument Value | Type | Packages | Tuning Parameters | |
|---|---|---|---|---|---|
| 0 | Boosted Classification Trees | ada | Classification | ada, plyr | iter, maxdepth, nu |
| 1 | Bagged AdaBoost | AdaBag | Classification | adabag, plyr | mfinal, maxdepth |
| 2 | AdaBoost.M1 | AdaBoost.M1 | Classification | adabag, plyr | mfinal, maxdepth, coeflearn |
| 3 | Adaptive Mixture Discriminant Analysis | amdai | Classification | adaptDA | model |
| 4 | Adaptive-Network-Based Fuzzy Inference System | ANFIS | Regression | frbs | num.labels, max.iter |
| 5 | Model Averaged Neural Network | avNNet | Dual Use | nnet | size, decay, bag |
| 6 | Bagged Model | bag | Dual Use | caret | vars |
| 7 | Bagged MARS | bagEarth | Dual Use | earth | nprune, degree |
| 8 | Bagged MARS using gCV Pruning | bagEarthGCV | Dual Use | earth | degree |
| 9 | Bagged Flexible Discriminant Analysis | bagFDA | Classification | earth, mda | degree, nprune |
| 10 | Bagged FDA using gCV Pruning | bagFDAGCV | Classification | earth | degree |
| 11 | Bayesian Generalized Linear Model | bayesglm | Dual Use | arm | None |
| 12 | Self-Organizing Map | bdk | Dual Use | kohonen | xdim, ydim, xweight, topo |
| 13 | Binary Discriminant Analysis | binda | Classification | binda | lambda.freqs |
| 14 | Boosted Tree | blackboost | Dual Use | party, mboost, plyr | mstop, maxdepth |
| 15 | Random Forest with Additional Feature Selection | Boruta | Dual Use | Boruta, randomForest | mtry |
| 16 | Bayesian Regularized Neural Networks | brnn | Regression | brnn | neurons |
| 17 | Boosted Linear Model | bstLs | Dual Use | bst, plyr | mstop, nu |
| 18 | Boosted Smoothing Spline | bstSm | Dual Use | bst, plyr | mstop, nu |
| 19 | Boosted Tree | bstTree | Dual Use | bst, plyr | mstop, maxdepth, nu |
| 20 | C5.0 | C5.0 | Classification | C50, plyr | trials, model, winnow |
| 21 | Cost-Sensitive C5.0 | C5.0Cost | Classification | C50, plyr | trials, model, winnow, cost |
| 22 | Single C5.0 Ruleset | C5.0Rules | Classification | C50 | None |
| 23 | Single C5.0 Tree | C5.0Tree | Classification | C50 | None |
| 24 | Conditional Inference Random Forest | cforest | Dual Use | party | mtry |
| 25 | SIMCA | CSimca | Classification | rrcovHD | None |
| 26 | Conditional Inference Tree | ctree | Dual Use | party | mincriterion |
| 27 | Conditional Inference Tree | ctree2 | Dual Use | party | maxdepth |
| 28 | Cubist | cubist | Regression | Cubist | committees, neighbors |
| 29 | Dynamic Evolving Neural-Fuzzy Inference System | DENFIS | Regression | frbs | Dthr, max.iter |
| 30 | Stacked AutoEncoder Deep Neural Network | dnn | Dual Use | deepnet | layer1, layer2, layer3, hidden_dropout, visibl... |
| 31 | Multivariate Adaptive Regression Spline | earth | Dual Use | earth | nprune, degree |
| 32 | Extreme Learning Machine | elm | Dual Use | elmNN | nhid, actfun |
| 33 | Elasticnet | enet | Regression | elasticnet | fraction, lambda |
| 34 | Ensemble Partial Least Squares Regression with... | enpls.fs | Regression | enpls | maxcomp, threshold |
| 35 | Ensemble Partial Least Squares Regression | enpls | Regression | enpls | maxcomp |
| 36 | Tree Models from Genetic Algorithms | evtree | Dual Use | evtree | alpha |
| 37 | Random Forest by Randomization | extraTrees | Dual Use | extraTrees | mtry, numRandomCuts |
| 38 | Flexible Discriminant Analysis | fda | Classification | earth, mda | degree, nprune |
| 39 | Fuzzy Rules Using Genetic Cooperative-Competit... | FH.GBML | Classification | frbs | max.num.rule, popu.size, max.gen |
| 40 | Fuzzy Inference Rules by Descent Method | FIR.DM | Regression | frbs | num.labels, max.iter |
| 41 | Ridge Regression with Variable Selection | foba | Regression | foba | k, lambda |
| 42 | Fuzzy Rules Using Chi's Method | FRBCS.CHI | Classification | frbs | num.labels, type.mf |
| 43 | Fuzzy Rules with Weight Factor | FRBCS.W | Classification | frbs | num.labels, type.mf |
| 44 | Simplified TSK Fuzzy Rules | FS.HGD | Regression | frbs | num.labels, max.iter |
| 45 | Generalized Additive Model using Splines | gam | Dual Use | mgcv | select, method |
| 46 | Boosted Generalized Additive Model | gamboost | Dual Use | mboost | mstop, prune |
| 47 | Generalized Additive Model using LOESS | gamLoess | Dual Use | gam | span, degree |
| 48 | Generalized Additive Model using Splines | gamSpline | Dual Use | gam | df |
| 49 | Gaussian Process | gaussprLinear | Dual Use | kernlab | None |
| 50 | Gaussian Process with Polynomial Kernel | gaussprPoly | Dual Use | kernlab | degree, scale |
| 51 | Gaussian Process with Radial Basis Function Ke... | gaussprRadial | Dual Use | kernlab | sigma |
| 52 | Stochastic Gradient Boosting | gbm | Dual Use | gbm, plyr | n.trees, interaction.depth, shrinkage |
| 53 | Multivariate Adaptive Regression Splines | gcvEarth | Dual Use | earth | degree |
| 54 | Fuzzy Rules via MOGUL | GFS.FR.MOGAL | Regression | frbs | max.gen, max.iter, max.tune |
| 55 | Fuzzy Rules Using Genetic Cooperative-Competit... | GFS.GCCL | Classification | frbs | num.labels, popu.size, max.gen |
| 56 | Genetic Lateral Tuning and Rule Selection of L... | GFS.LT.RS | Regression | frbs | popu.size, num.labels, max.gen |
| 57 | Fuzzy Rules via Thrift | GFS.THRIFT | Regression | frbs | popu.size, num.labels, max.gen |
| 58 | Generalized Linear Model | glm | Dual Use | NaN | None |
| 59 | Boosted Generalized Linear Model | glmboost | Dual Use | mboost | mstop, prune |
| 60 | glmnet | glmnet | Dual Use | glmnet | alpha, lambda |
| 61 | Generalized Linear Model with Stepwise Feature... | glmStepAIC | Dual Use | MASS | None |
| 62 | Generalized Partial Least Squares | gpls | Classification | gpls | K.prov |
| 63 | Heteroscedastic Discriminant Analysis | hda | Classification | hda | gamma, lambda, newdim |
| 64 | High Dimensional Discriminant Analysis | hdda | Classification | HDclassif | threshold, model |
| 65 | Hybrid Neural Fuzzy Inference System | HYFIS | Regression | frbs | num.labels, max.iter |
| 66 | Independent Component Regression | icr | Regression | fastICA | n.comp |
| 67 | C4.5-like Trees | J48 | Classification | RWeka | C |
| 68 | Rule-Based Classifier | JRip | Classification | RWeka | NumOpt |
| 69 | Partial Least Squares | kernelpls | Dual Use | pls | ncomp |
| 70 | k-Nearest Neighbors | kknn | Dual Use | kknn | kmax, distance, kernel |
| 71 | k-Nearest Neighbors | knn | Dual Use | NaN | k |
| 72 | Polynomial Kernel Regularized Least Squares | krlsPoly | Regression | KRLS | lambda, degree |
| 73 | Radial Basis Function Kernel Regularized Least... | krlsRadial | Regression | KRLS, kernlab | lambda, sigma |
| 74 | Least Angle Regression | lars | Regression | lars | fraction |
| 75 | Least Angle Regression | lars2 | Regression | lars | step |
| 76 | The lasso | lasso | Regression | elasticnet | fraction |
| 77 | Linear Discriminant Analysis | lda | Classification | MASS | None |
| 78 | Linear Discriminant Analysis | lda2 | Classification | MASS | dimen |
| 79 | Linear Regression with Backwards Selection | leapBackward | Regression | leaps | nvmax |
| 80 | Linear Regression with Forward Selection | leapForward | Regression | leaps | nvmax |
| 81 | Linear Regression with Stepwise Selection | leapSeq | Regression | leaps | nvmax |
| 82 | Robust Linear Discriminant Analysis | Linda | Classification | rrcov | None |
| 83 | Linear Regression | lm | Regression | NaN | None |
| 84 | Linear Regression with Stepwise Selection | lmStepAIC | Regression | MASS | None |
| 85 | Logistic Model Trees | LMT | Classification | RWeka | iter |
| 86 | Bagged Logic Regression | logicBag | Dual Use | logicFS | nleaves, ntrees |
| 87 | Boosted Logistic Regression | LogitBoost | Classification | caTools | nIter |
| 88 | Logic Regression | logreg | Dual Use | LogicReg | treesize, ntrees |
| 89 | Least Squares Support Vector Machine | lssvmLinear | Classification | kernlab | None |
| 90 | Least Squares Support Vector Machine with Poly... | lssvmPoly | Classification | kernlab | degree, scale |
| 91 | Least Squares Support Vector Machine with Radi... | lssvmRadial | Classification | kernlab | sigma |
| 92 | Learning Vector Quantization | lvq | Classification | class | size, k |
| 93 | Model Tree | M5 | Regression | RWeka | pruned, smoothed, rules |
| 94 | Model Rules | M5Rules | Regression | RWeka | pruned, smoothed |
| 95 | Mixture Discriminant Analysis | mda | Classification | mda | subclasses |
| 96 | Maximum Uncertainty Linear Discriminant Analysis | Mlda | Classification | HiDimDA | None |
| 97 | Multi-Layer Perceptron | mlp | Dual Use | RSNNS | size |
| 98 | Multi-Layer Perceptron | mlpWeightDecay | Dual Use | RSNNS | size, decay |
| 99 | Penalized Multinomial Regression | multinom | Classification | nnet | decay |
| 100 | Naive Bayes | nb | Classification | klaR | fL, usekernel |
| 101 | Neural Network | neuralnet | Regression | neuralnet | layer1, layer2, layer3 |
| 102 | Neural Network | nnet | Dual Use | nnet | size, decay |
| 103 | Tree-Based Ensembles | nodeHarvest | Dual Use | nodeHarvest | maxinter, mode |
| 104 | Oblique Trees | oblique.tree | Classification | oblique.tree | oblique.splits, variable.selection |
| 105 | Single Rule Classification | OneR | Classification | RWeka | None |
| 106 | Oblique Random Forest | ORFlog | Classification | obliqueRF | mtry |
| 107 | Oblique Random Forest | ORFpls | Classification | obliqueRF | mtry |
| 108 | Oblique Random Forest | ORFridge | Classification | obliqueRF | mtry |
| 109 | Oblique Random Forest | ORFsvm | Classification | obliqueRF | mtry |
| 110 | Nearest Shrunken Centroids | pam | Classification | pamr | threshold |
| 111 | Parallel Random Forest | parRF | Dual Use | randomForest | mtry |
| 112 | Rule-Based Classifier | PART | Classification | RWeka | threshold, pruned |
| 113 | partDSA | partDSA | Dual Use | partDSA | cut.off.growth, MPD |
| 114 | Neural Networks with Feature Extraction | pcaNNet | Dual Use | nnet | size, decay |
| 115 | Principal Component Analysis | pcr | Regression | pls | ncomp |
| 116 | Penalized Discriminant Analysis | pda | Classification | mda | lambda |
| 117 | Penalized Discriminant Analysis | pda2 | Classification | mda | df |
| 118 | Penalized Linear Regression | penalized | Regression | penalized | lambda1, lambda2 |
| 119 | Penalized Linear Discriminant Analysis | PenalizedLDA | Classification | penalizedLDA, plyr | lambda, K |
| 120 | Penalized Logistic Regression | plr | Classification | stepPlr | lambda, cp |
| 121 | Partial Least Squares | pls | Dual Use | pls | ncomp |
| 122 | Partial Least Squares Generalized Linear Models | plsRglm | Dual Use | plsRglm | nt, alpha.pvals.expli |
| 123 | Ordered Logistic or Probit Regression | polr | Classification | MASS | None |
| 124 | Projection Pursuit Regression | ppr | Regression | NaN | nterms |
| 125 | Greedy Prototype Selection | protoclass | Classification | proxy, protoclass | eps, Minkowski |
| 126 | Quadratic Discriminant Analysis | qda | Classification | MASS | None |
| 127 | Robust Quadratic Discriminant Analysis | QdaCov | Classification | rrcov | None |
| 128 | Quantile Random Forest | qrf | Regression | quantregForest | mtry |
| 129 | Quantile Regression Neural Network | qrnn | Regression | qrnn | n.hidden, penalty, bag |
| 130 | Radial Basis Function Network | rbf | Classification | RSNNS | size |
| 131 | Radial Basis Function Network | rbfDDA | Dual Use | RSNNS | negativeThreshold |
| 132 | Regularized Discriminant Analysis | rda | Classification | klaR | gamma, lambda |
| 133 | Relaxed Lasso | relaxo | Regression | relaxo, plyr | lambda, phi |
| 134 | Random Forest | rf | Dual Use | randomForest | mtry |
| 135 | Random Ferns | rFerns | Classification | rFerns | depth |
| 136 | Factor-Based Linear Discriminant Analysis | RFlda | Classification | HiDimDA | q |
| 137 | Ridge Regression | ridge | Regression | elasticnet | lambda |
| 138 | Random k-Nearest Neighbors | rknn | Dual Use | rknn | k, mtry |
| 139 | Random k-Nearest Neighbors with Feature Selection | rknnBel | Dual Use | rknn, plyr | k, mtry, d |
| 140 | Robust Linear Model | rlm | Regression | MASS | None |
| 141 | Robust Mixture Discriminant Analysis | rmda | Classification | robustDA | K, model |
| 142 | ROC-Based Classifier | rocc | Classification | rocc | xgenes |
| 143 | CART | rpart | Dual Use | rpart | cp |
| 144 | CART | rpart2 | Dual Use | rpart | maxdepth |
| 145 | Cost-Sensitive CART | rpartCost | Classification | rpart | cp, Cost |
| 146 | Regularized Random Forest | RRF | Dual Use | randomForest, RRF | mtry, coefReg, coefImp |
| 147 | Regularized Random Forest | RRFglobal | Dual Use | RRF | mtry, coefReg |
| 148 | Robust Regularized Linear Discriminant Analysis | rrlda | Classification | rrlda | lambda, hp, penalty |
| 149 | Robust SIMCA | RSimca | Classification | rrcovHD | None |
| 150 | Relevance Vector Machines with Linear Kernel | rvmLinear | Regression | kernlab | None |
| 151 | Relevance Vector Machines with Polynomial Kernel | rvmPoly | Regression | kernlab | scale, degree |
| 152 | Relevance Vector Machines with Radial Basis Fu... | rvmRadial | Regression | kernlab | sigma |
| 153 | Subtractive Clustering and Fuzzy c-Means Rules | SBC | Regression | frbs | r.a, eps.high, eps.low |
| 154 | Shrinkage Discriminant Analysis | sda | Classification | sda | diagonal, lambda |
| 155 | Stepwise Diagonal Linear Discriminant Analysis | sddaLDA | Classification | SDDA | None |
| 156 | Stepwise Diagonal Quadratic Discriminant Analysis | sddaQDA | Classification | SDDA | None |
| 157 | Partial Least Squares | simpls | Dual Use | pls | ncomp |
| 158 | Fuzzy Rules Using the Structural Learning Algo... | SLAVE | Classification | frbs | num.labels, max.iter, max.gen |
| 159 | Stabilized Linear Discriminant Analysis | slda | Classification | ipred | None |
| 160 | Sparse Mixture Discriminant Analysis | smda | Classification | sparseLDA | NumVars, lambda, R |
| 161 | Sparse Linear Discriminant Analysis | sparseLDA | Classification | sparseLDA | NumVars, lambda |
| 162 | Sparse Partial Least Squares | spls | Dual Use | spls | K, eta, kappa |
| 163 | Linear Discriminant Analysis with Stepwise Fea... | stepLDA | Classification | klaR, MASS | maxvar, direction |
| 164 | Quadratic Discriminant Analysis with Stepwise ... | stepQDA | Classification | klaR, MASS | maxvar, direction |
| 165 | Supervised Principal Component Analysis | superpc | Regression | superpc | threshold, n.components |
| 166 | Support Vector Machines with Boundrange String... | svmBoundrangeString | Dual Use | kernlab | length, C |
| 167 | Support Vector Machines with Exponential Strin... | svmExpoString | Dual Use | kernlab | lambda, C |
| 168 | Support Vector Machines with Linear Kernel | svmLinear | Dual Use | kernlab | C |
| 169 | Support Vector Machines with Polynomial Kernel | svmPoly | Dual Use | kernlab | degree, scale, C |
| 170 | Support Vector Machines with Radial Basis Func... | svmRadial | Dual Use | kernlab | sigma, C |
| 171 | Support Vector Machines with Radial Basis Func... | svmRadialCost | Dual Use | kernlab | C |
| 172 | Support Vector Machines with Class Weights | svmRadialWeights | Classification | kernlab | sigma, C, Weight |
| 173 | Support Vector Machines with Spectrum String K... | svmSpectrumString | Dual Use | kernlab | length, C |
| 174 | Bagged CART | treebag | Dual Use | ipred, plyr | None |
| 175 | Variational Bayesian Multinomial Probit Regres... | vbmpRadial | Classification | vbmp | estimateTheta |
| 176 | Partial Least Squares | widekernelpls | Dual Use | pls | ncomp |
| 177 | Wang and Mendel Fuzzy Rules | WM | Regression | frbs | num.labels, type.mf |
| 178 | Weighted Subspace Random Forest | wsrf | Classification | wsrf | mtry |
| 179 | Self-Organizing Maps | xyf | Dual Use | kohonen | xdim, ydim, xweight, topo |
Chapters use the following models:
02_A_Short_Tour.R lm, earth
04_Over_Fitting.R svmRadial, glm
06_Linear_Regression.R lm, pls, pcr, ridge, enet
07_Non-Linear_Reg.R avNNet, earth, svmRadial, svmPoly, knn
08_Regression_Trees.R rpart, ctree, M5, treebag, rf, cforest, gbm
10_Case_Study_Concrete.R lm, pls, enet, earth, svmRadial, avNNet, rpart,
treebag, ctree, rf, gbm, cubist, M5, Nelder-Mead
11_Class_Performance.R glm
12_Discriminant_Analysis.R svmRadial, glm, lda, pls, glmnet, pam
13_Non-Linear_Class.R mda, nnet, avNNet, fda, svmRadial, svmPoly, knn, nb
14_Class_Trees.R rpart, J48, PART, treebag, rf, gbm, C5.0
16_Class_Imbalance.R rf, glm, fda, svmRadial, rpart, C5.0
17_Job_Scheduling.R rpart, lda, sparseLDA, nnet, pls, fda, rf, C5.0,
treebag, svmRadial
19_Feature_Select.R rf, lda, svmRadial, nb, glm, knn, svmRadial, knn
Training control methods used by the scripts:
04_Over_Fitting.R repeatedcv, cv, LOOCV, LGOCV, boot, boot632 06_Linear_Regression.R cv 07_Non-Linear_Reg.R cv 08_Regression_Trees.R cv, oob 10_Case_Study_Concrete.R repeatedcv 11_Class_Performance.R repeatedcv 12_Discriminant_Analysis.R cv, LGOCV 13_Non-Linear_Class.R LGOCV 14_Class_Trees.R LGOCV 16_Class_Imbalance.R cv 17_Job_Scheduling.R repeatedcv 19_Feature_Select.R repeatedcv, cv
%%R
APMchapters = c(
"",
"02_A_Short_Tour.R",
"03_Data_Pre_Processing.R",
"04_Over_Fitting.R",
"",
"06_Linear_Regression.R",
"07_Non-Linear_Reg.R",
"08_Regression_Trees.R",
"",
"10_Case_Study_Concrete.R",
"11_Class_Performance.R",
"12_Discriminant_Analysis.R",
"13_Non-Linear_Class.R",
"14_Class_Trees.R",
"",
"16_Class_Imbalance.R",
"17_Job_Scheduling.R",
"18_Importance.R",
"19_Feature_Select.R",
"CreateGrantData.R")
showChapterScript = function(n) {
if (APMchapters[n] != "")
file.show( file.path( scriptLocation(), APMchapters[n] ))
}
showChapterOutput = function(n) {
if (APMchapters[n] != "")
file.show( file.path( scriptLocation(), paste(APMchapters[n],"out",sep="") ))
}
runChapterScript = function(n) {
if (APMchapters[n] != "")
source( file.path( scriptLocation(), APMchapters[n] ), echo=TRUE )
}
%%R
showChapterScript(2)
NULL
%%R
# showChapterOutput(2)
NULL
%%R -w 600 -h 600
runChapterScript(2)
## user system elapsed
## 4.971 0.114 5.292
NULL
%%R
# Another way to run the script for Chapter 2:
PATIENT = TRUE
if (PATIENT) {
current_working_directory = getwd() # remember current directory
chapter_code_directory = scriptLocation()
setwd( chapter_code_directory )
print(dir())
print(source("02_A_Short_Tour.R", echo=TRUE))
setwd(current_working_directory) # return to working directory
}
[1] "02_A_Short_Tour.R" "02_A_Short_Tour.Rout"
[3] "03_Data_Pre_Processing.R" "03_Data_Pre_Processing.Rout"
[5] "04_Over_Fitting.R" "04_Over_Fitting.Rout"
[7] "06_Linear_Regression.R" "06_Linear_Regression.Rout"
[9] "07_Non-Linear_Reg.R" "07_Non-Linear_Reg.Rout"
[11] "08_Regression_Trees.R" "08_Regression_Trees.Rout"
[13] "10_Case_Study_Concrete.R" "10_Case_Study_Concrete.Rout"
[15] "11_Class_Performance.R" "11_Class_Performance.Rout"
[17] "12_Discriminant_Analysis.R" "12_Discriminant_Analysis.Rout"
[19] "13_Non-Linear_Class.R" "13_Non-Linear_Class.Rout"
[21] "14_Class_Trees.R" "14_Class_Trees.Rout"
[23] "16_Class_Imbalance.R" "16_Class_Imbalance.Rout"
[25] "17_Job_Scheduling.R" "17_Job_Scheduling.Rout"
[27] "18_Importance.R" "18_Importance.Rout"
[29] "19_Feature_Select.R" "19_Feature_Select.Rout"
[31] "CreateGrantData.R" "CreateGrantData.Rout"
> ################################################################################
> ### R code from Applied Predictive Modeling (2013) by Kuhn and Jo .... [TRUNCATED]
> data(FuelEconomy)
> ## Format data for plotting against engine displacement
>
> ## Sort by engine displacement
> cars2010 <- cars2010[order(cars2010$EngDispl),]
> cars2011 <- cars2011[order(cars2011$EngDispl),]
> ## Combine data into one data frame
> cars2010a <- cars2010
> cars2010a$Year <- "2010 Model Year"
> cars2011a <- cars2011
> cars2011a$Year <- "2011 Model Year"
> plotData <- rbind(cars2010a, cars2011a)
> library(lattice)
> xyplot(FE ~ EngDispl|Year, plotData,
+ xlab = "Engine Displacement",
+ ylab = "Fuel Efficiency (MPG)",
+ between = list(x = 1.2 .... [TRUNCATED]
> ## Fit a single linear model and conduct 10-fold CV to estimate the error
> library(caret)
> set.seed(1)
> lm1Fit <- train(FE ~ EngDispl,
+ data = cars2010,
+ method = "lm",
+ trControl = trainControl(meth .... [TRUNCATED]
> lm1Fit
Linear Regression
1107 samples
13 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 997, 996, 995, 996, 997, 996, ...
Resampling results
RMSE Rsquared RMSE SD Rsquared SD
4.604285 0.628494 0.492878 0.04418925
> ## Fit a quadratic model too
>
> ## Create squared terms
> cars2010$ED2 <- cars2010$EngDispl^2
> cars2011$ED2 <- cars2011$EngDispl^2
> set.seed(1)
> lm2Fit <- train(FE ~ EngDispl + ED2,
+ data = cars2010,
+ method = "lm",
+ trControl = trainContro .... [TRUNCATED]
> lm2Fit
Linear Regression
1107 samples
14 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 997, 996, 995, 996, 997, 996, ...
Resampling results
RMSE Rsquared RMSE SD Rsquared SD
4.228432 0.6843226 0.4194454 0.04210009
> ## Finally a MARS model (via the earth package)
>
> library(earth)
Loading required package: plotmo
Loading required package: plotrix
> set.seed(1)
> marsFit <- train(FE ~ EngDispl,
+ data = cars2010,
+ method = "earth",
+ tuneLength = 15,
+ .... [TRUNCATED]
> marsFit
Multivariate Adaptive Regression Spline
1107 samples
14 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 997, 996, 995, 996, 997, 996, ...
Resampling results across tuning parameters:
nprune RMSE Rsquared RMSE SD Rsquared SD
2 4.295551 0.6734579 0.4412493 0.04289014
3 4.255755 0.6802699 0.4403794 0.03947172
4 4.228066 0.6845448 0.4488977 0.04278739
5 4.249977 0.6820430 0.4886947 0.04318735
Tuning parameter 'degree' was held constant at a value of 1
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were nprune = 4 and degree = 1.
> plot(marsFit)
> ## Predict the test set data
> cars2011$lm1 <- predict(lm1Fit, cars2011)
> cars2011$lm2 <- predict(lm2Fit, cars2011)
> cars2011$mars <- predict(marsFit, cars2011)
> ## Get test set performance values via caret's postResample function
>
> postResample(pred = cars2011$lm1, obs = cars2011$FE)
RMSE Rsquared
5.1625309 0.7018642
> postResample(pred = cars2011$lm2, obs = cars2011$FE)
RMSE Rsquared
4.7162853 0.7486074
> postResample(pred = cars2011$mars, obs = cars2011$FE)
RMSE Rsquared
4.6855501 0.7499953
> ################################################################################
> ### Session Information
>
> sessionInfo()
R version 3.1.3 (2015-03-09)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.3 (Yosemite)
locale:
[1] C
attached base packages:
[1] parallel tools stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] earth_4.2.0 plotrix_3.5-11
[3] plotmo_2.2.1 pROC_1.7.3
[5] doMC_1.3.3 iterators_1.0.7
[7] foreach_1.4.2 AppliedPredictiveModeling_1.1-6
[9] caret_6.0-41 ggplot2_1.0.1
[11] lattice_0.20-31
loaded via a namespace (and not attached):
[1] BradleyTerry2_1.0-6 CORElearn_0.9.45 MASS_7.3-40
[4] Matrix_1.1-5 Rcpp_0.11.5 SparseM_1.6
[7] brglm_0.5-9 car_2.0-25 class_7.3-12
[10] cluster_2.0.1 codetools_0.2-10 colorspace_1.2-6
[13] compiler_3.1.3 digest_0.6.8 e1071_1.6-4
[16] grid_3.1.3 gtable_0.1.2 gtools_3.4.1
[19] lme4_1.1-7 mgcv_1.8-4 minqa_1.2.4
[22] munsell_0.4.2 nlme_3.1-120 nloptr_1.0.4
[25] nnet_7.3-9 pbkrtest_0.4-2 plyr_1.8.1
[28] proto_0.3-10 quantreg_5.11 reshape2_1.4.1
[31] rpart_4.1-9 scales_0.2.4 splines_3.1.3
[34] stringr_0.6.2
> ### q("no")
>
>
$value
R version 3.1.3 (2015-03-09)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.3 (Yosemite)
locale:
[1] C
attached base packages:
[1] parallel tools stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] earth_4.2.0 plotrix_3.5-11
[3] plotmo_2.2.1 pROC_1.7.3
[5] doMC_1.3.3 iterators_1.0.7
[7] foreach_1.4.2 AppliedPredictiveModeling_1.1-6
[9] caret_6.0-41 ggplot2_1.0.1
[11] lattice_0.20-31
loaded via a namespace (and not attached):
[1] BradleyTerry2_1.0-6 CORElearn_0.9.45 MASS_7.3-40
[4] Matrix_1.1-5 Rcpp_0.11.5 SparseM_1.6
[7] brglm_0.5-9 car_2.0-25 class_7.3-12
[10] cluster_2.0.1 codetools_0.2-10 colorspace_1.2-6
[13] compiler_3.1.3 digest_0.6.8 e1071_1.6-4
[16] grid_3.1.3 gtable_0.1.2 gtools_3.4.1
[19] lme4_1.1-7 mgcv_1.8-4 minqa_1.2.4
[22] munsell_0.4.2 nlme_3.1-120 nloptr_1.0.4
[25] nnet_7.3-9 pbkrtest_0.4-2 plyr_1.8.1
[28] proto_0.3-10 quantreg_5.11 reshape2_1.4.1
[31] rpart_4.1-9 scales_0.2.4 splines_3.1.3
[34] stringr_0.6.2
$visible
[1] TRUE
%%R
## Another way to run the Chapter 2 script
library(AppliedPredictiveModeling)
data(FuelEconomy)
## Format data for plotting against engine displacement
## Sort by engine displacement
cars2010 <- cars2010[order(cars2010$EngDispl),]
cars2011 <- cars2011[order(cars2011$EngDispl),]
## Combine data into one data frame
cars2010a <- cars2010
cars2010a$Year <- "2010 Model Year"
cars2011a <- cars2011
cars2011a$Year <- "2011 Model Year"
plotData <- rbind(cars2010a, cars2011a)
library(lattice)
print(
xyplot(FE ~ EngDispl|Year, plotData,
xlab = "Engine Displacement",
ylab = "Fuel Efficiency (MPG)",
between = list(x = 1.2))
)
########## 'plot' routines in the lattice package must be print'ed to obtain their output !
## Fit a single linear model and conduct 10-fold CV to estimate the error
library(caret)
set.seed(1)
lm1Fit <- train(FE ~ EngDispl,
data = cars2010,
method = "lm",
trControl = trainControl(method= "cv"))
print(lm1Fit)
## Fit a quadratic model too
## Create squared terms
cars2010$ED2 <- cars2010$EngDispl^2
cars2011$ED2 <- cars2011$EngDispl^2
set.seed(1)
lm2Fit <- train(FE ~ EngDispl + ED2,
data = cars2010,
method = "lm",
trControl = trainControl(method= "cv"))
print(lm2Fit)
Linear Regression 1107 samples 13 predictor No pre-processing Resampling: Cross-Validated (10 fold) Summary of sample sizes: 997, 996, 995, 996, 997, 996, ... Resampling results RMSE Rsquared RMSE SD Rsquared SD 4.604285 0.628494 0.492878 0.04418925 Linear Regression 1107 samples 14 predictor No pre-processing Resampling: Cross-Validated (10 fold) Summary of sample sizes: 997, 996, 995, 996, 997, 996, ... Resampling results RMSE Rsquared RMSE SD Rsquared SD 4.228432 0.6843226 0.4194454 0.04210009
%%R
## Finally a MARS model (via the earth package)
library(earth)
set.seed(1)
marsFit <- train(FE ~ EngDispl,
data = cars2010,
method = "earth",
tuneLength = 15,
trControl = trainControl(method= "cv"))
print(marsFit)
plot(marsFit)
Multivariate Adaptive Regression Spline 1107 samples 14 predictor No pre-processing Resampling: Cross-Validated (10 fold) Summary of sample sizes: 997, 996, 995, 996, 997, 996, ... Resampling results across tuning parameters: nprune RMSE Rsquared RMSE SD Rsquared SD 2 4.295551 0.6734579 0.4412493 0.04289014 3 4.255755 0.6802699 0.4403794 0.03947172 4 4.228066 0.6845448 0.4488977 0.04278739 5 4.249977 0.6820430 0.4886947 0.04318735 Tuning parameter 'degree' was held constant at a value of 1 RMSE was used to select the optimal model using the smallest value. The final values used for the model were nprune = 4 and degree = 1.
%%R
## Predict the test set data
cars2011$lm1 <- predict(lm1Fit, cars2011)
cars2011$lm2 <- predict(lm2Fit, cars2011)
cars2011$mars <- predict(marsFit, cars2011)
## Get test set performance values via caret's postResample function
print(postResample(pred = cars2011$lm1, obs = cars2011$FE))
print(postResample(pred = cars2011$lm2, obs = cars2011$FE))
print(postResample(pred = cars2011$mars, obs = cars2011$FE))
RMSE Rsquared
5.1625309 0.7018642
RMSE Rsquared
4.7162853 0.7486074
RMSE Rsquared
4.6855501 0.7499953
%%R
showChapterScript(3)
NULL
%%R
showChapterOutput(3)
NULL
%%R -w 600 -h 600
runChapterScript(3)
## user system elapsed
## 5.791 0.147 6.146
NULL
%%R
### Section 3.1 Case Study: Cell Segmentation in High-Content Screening
library(AppliedPredictiveModeling)
data(segmentationOriginal)
## Retain the original training set
segTrain <- subset(segmentationOriginal, Case == "Train")
## Remove the first three columns (identifier columns)
segTrainX <- segTrain[, -(1:3)]
segTrainClass <- segTrain$Class
print(colnames(segTrain))
print(table(segTrainClass))
[1] "Cell" "Case" [3] "Class" "AngleCh1" [5] "AngleStatusCh1" "AreaCh1" [7] "AreaStatusCh1" "AvgIntenCh1" [9] "AvgIntenCh2" "AvgIntenCh3" [11] "AvgIntenCh4" "AvgIntenStatusCh1" [13] "AvgIntenStatusCh2" "AvgIntenStatusCh3" [15] "AvgIntenStatusCh4" "ConvexHullAreaRatioCh1" [17] "ConvexHullAreaRatioStatusCh1" "ConvexHullPerimRatioCh1" [19] "ConvexHullPerimRatioStatusCh1" "DiffIntenDensityCh1" [21] "DiffIntenDensityCh3" "DiffIntenDensityCh4" [23] "DiffIntenDensityStatusCh1" "DiffIntenDensityStatusCh3" [25] "DiffIntenDensityStatusCh4" "EntropyIntenCh1" [27] "EntropyIntenCh3" "EntropyIntenCh4" [29] "EntropyIntenStatusCh1" "EntropyIntenStatusCh3" [31] "EntropyIntenStatusCh4" "EqCircDiamCh1" [33] "EqCircDiamStatusCh1" "EqEllipseLWRCh1" [35] "EqEllipseLWRStatusCh1" "EqEllipseOblateVolCh1" [37] "EqEllipseOblateVolStatusCh1" "EqEllipseProlateVolCh1" [39] "EqEllipseProlateVolStatusCh1" "EqSphereAreaCh1" [41] "EqSphereAreaStatusCh1" "EqSphereVolCh1" [43] "EqSphereVolStatusCh1" "FiberAlign2Ch3" [45] "FiberAlign2Ch4" "FiberAlign2StatusCh3" [47] "FiberAlign2StatusCh4" "FiberLengthCh1" [49] "FiberLengthStatusCh1" "FiberWidthCh1" [51] "FiberWidthStatusCh1" "IntenCoocASMCh3" [53] "IntenCoocASMCh4" "IntenCoocASMStatusCh3" [55] "IntenCoocASMStatusCh4" "IntenCoocContrastCh3" [57] "IntenCoocContrastCh4" "IntenCoocContrastStatusCh3" [59] "IntenCoocContrastStatusCh4" "IntenCoocEntropyCh3" [61] "IntenCoocEntropyCh4" "IntenCoocEntropyStatusCh3" [63] "IntenCoocEntropyStatusCh4" "IntenCoocMaxCh3" [65] "IntenCoocMaxCh4" "IntenCoocMaxStatusCh3" [67] "IntenCoocMaxStatusCh4" "KurtIntenCh1" [69] "KurtIntenCh3" "KurtIntenCh4" [71] "KurtIntenStatusCh1" "KurtIntenStatusCh3" [73] "KurtIntenStatusCh4" "LengthCh1" [75] "LengthStatusCh1" "MemberAvgAvgIntenStatusCh2" [77] "MemberAvgTotalIntenStatusCh2" "NeighborAvgDistCh1" [79] "NeighborAvgDistStatusCh1" "NeighborMinDistCh1" [81] "NeighborMinDistStatusCh1" "NeighborVarDistCh1" [83] "NeighborVarDistStatusCh1" "PerimCh1" [85] "PerimStatusCh1" "ShapeBFRCh1" [87] "ShapeBFRStatusCh1" "ShapeLWRCh1" [89] "ShapeLWRStatusCh1" "ShapeP2ACh1" [91] "ShapeP2AStatusCh1" "SkewIntenCh1" [93] "SkewIntenCh3" "SkewIntenCh4" [95] "SkewIntenStatusCh1" "SkewIntenStatusCh3" [97] "SkewIntenStatusCh4" "SpotFiberCountCh3" [99] "SpotFiberCountCh4" "SpotFiberCountStatusCh3" [101] "SpotFiberCountStatusCh4" "TotalIntenCh1" [103] "TotalIntenCh2" "TotalIntenCh3" [105] "TotalIntenCh4" "TotalIntenStatusCh1" [107] "TotalIntenStatusCh2" "TotalIntenStatusCh3" [109] "TotalIntenStatusCh4" "VarIntenCh1" [111] "VarIntenCh3" "VarIntenCh4" [113] "VarIntenStatusCh1" "VarIntenStatusCh3" [115] "VarIntenStatusCh4" "WidthCh1" [117] "WidthStatusCh1" "XCentroid" [119] "YCentroid" segTrainClass PS WS 636 373
%%R
### Section 3.2 Data Transformations for Individual Predictors
## The column VarIntenCh3 measures the standard deviation of the intensity
## of the pixels in the actin filaments
max(segTrainX$VarIntenCh3)/min(segTrainX$VarIntenCh3)
library(e1071)
skewness(segTrainX$VarIntenCh3)
library(caret)
## Use caret's preProcess function to transform for skewness
segPP <- preProcess(segTrainX, method = "BoxCox")
## Apply the transformations
segTrainTrans <- predict(segPP, segTrainX)
## Results for a single predictor
segPP$bc$VarIntenCh3
print(
histogram(~segTrainX$VarIntenCh3,
xlab = "Natural Units",
type = "count")
)
print(
histogram(~log(segTrainX$VarIntenCh3),
xlab = "Log Units",
ylab = " ",
type = "count")
)
print(
segPP$bc$PerimCh1
)
print(
histogram(~segTrainX$PerimCh1,
xlab = "Natural Units",
type = "count")
)
print(
histogram(~segTrainTrans$PerimCh1,
xlab = "Transformed Data",
ylab = " ",
type = "count")
)
Box-Cox Transformation 1009 data points used to estimate Lambda Input data summary: Min. 1st Qu. Median Mean 3rd Qu. Max. 47.74 64.37 79.02 91.61 103.20 459.80 Largest/Smallest: 9.63 Sample Skewness: 2.59 Estimated Lambda: -1.1
%%R
### Section 3.3 Data Transformations for Multiple Predictors
## R's prcomp is used to conduct PCA
pr <- prcomp(~ AvgIntenCh1 + EntropyIntenCh1,
data = segTrainTrans,
scale. = TRUE)
transparentTheme(pchSize = .7, trans = .3)
print(
xyplot(AvgIntenCh1 ~ EntropyIntenCh1,
data = segTrainTrans,
groups = segTrain$Class,
xlab = "Channel 1 Fiber Width",
ylab = "Intensity Entropy Channel 1",
auto.key = list(columns = 2),
type = c("p", "g"),
main = "Original Data",
aspect = 1)
)
print(
xyplot(PC2 ~ PC1,
data = as.data.frame(pr$x),
groups = segTrain$Class,
xlab = "Principal Component #1",
ylab = "Principal Component #2",
main = "Transformed",
xlim = extendrange(pr$x),
ylim = extendrange(pr$x),
type = c("p", "g"),
aspect = 1)
)
## Apply PCA to the entire set of predictors.
## There are a few predictors with only a single value, so we remove these first
## (since PCA uses variances, which would be zero)
isZV <- apply(segTrainX, 2, function(x) length(unique(x)) == 1)
segTrainX <- segTrainX[, !isZV]
segPP <- preProcess(segTrainX, c("BoxCox", "center", "scale"))
segTrainTrans <- predict(segPP, segTrainX)
segPCA <- prcomp(segTrainTrans, center = TRUE, scale. = TRUE)
## Plot a scatterplot matrix of the first three components
transparentTheme(pchSize = .8, trans = .3)
panelRange <- extendrange(segPCA$x[, 1:3])
print(
splom(as.data.frame(segPCA$x[, 1:3]),
groups = segTrainClass,
type = c("p", "g"),
as.table = TRUE,
auto.key = list(columns = 2),
prepanel.limits = function(x) panelRange)
)
## Format the rotation values for plotting
segRot <- as.data.frame(segPCA$rotation[, 1:3])
## Derive the channel variable
vars <- rownames(segPCA$rotation)
channel <- rep(NA, length(vars))
channel[grepl("Ch1$", vars)] <- "Channel 1"
channel[grepl("Ch2$", vars)] <- "Channel 2"
channel[grepl("Ch3$", vars)] <- "Channel 3"
channel[grepl("Ch4$", vars)] <- "Channel 4"
segRot$Channel <- channel
segRot <- segRot[complete.cases(segRot),]
segRot$Channel <- factor(as.character(segRot$Channel))
## Plot a scatterplot matrix of the first three rotation variables
transparentTheme(pchSize = .8, trans = .7)
panelRange <- extendrange(segRot[, 1:3])
library(ellipse)
upperp <- function(...)
{
args <- list(...)
circ1 <- ellipse(diag(rep(1, 2)), t = .1)
panel.xyplot(circ1[,1], circ1[,2],
type = "l",
lty = trellis.par.get("reference.line")$lty,
col = trellis.par.get("reference.line")$col,
lwd = trellis.par.get("reference.line")$lwd)
circ2 <- ellipse(diag(rep(1, 2)), t = .2)
panel.xyplot(circ2[,1], circ2[,2],
type = "l",
lty = trellis.par.get("reference.line")$lty,
col = trellis.par.get("reference.line")$col,
lwd = trellis.par.get("reference.line")$lwd)
circ3 <- ellipse(diag(rep(1, 2)), t = .3)
panel.xyplot(circ3[,1], circ3[,2],
type = "l",
lty = trellis.par.get("reference.line")$lty,
col = trellis.par.get("reference.line")$col,
lwd = trellis.par.get("reference.line")$lwd)
panel.xyplot(args$x, args$y, groups = args$groups, subscripts = args$subscripts)
}
print(
splom(~segRot[, 1:3],
groups = segRot$Channel,
lower.panel = function(...){}, upper.panel = upperp,
prepanel.limits = function(x) panelRange,
auto.key = list(columns = 2))
)
%%R
### Section 3.5 Removing Variables
## To filter on correlations, we first get the correlation matrix for the
## predictor set
segCorr <- cor(segTrainTrans)
library(corrplot)
corrplot(segCorr, order = "hclust", tl.cex = .35)
## caret's findCorrelation function is used to identify columns to remove.
highCorr <- findCorrelation(segCorr, .75)
print(highCorr)
[1] 85 45 100 13 79 8 19 25 97 71 35 99 5 6 29 39 37 3 17 [20] 105 57 61 49 58 7 62 50 18 89 31 9 102 4 38 34 52 51 108 [39] 40 88 87 22 73
%%R
### Section 3.8 Computing (Creating Dummy Variables)
data(cars)
type <- c("convertible", "coupe", "hatchback", "sedan", "wagon")
cars$Type <- factor(apply(cars[, 14:18], 1, function(x) type[which(x == 1)]))
carSubset <- cars[sample(1:nrow(cars), 20), c(1, 2, 19)]
print(
head(carSubset)
)
print(
levels(carSubset$Type)
)
Price Mileage Type 759 13540.04 17343 sedan 303 18912.98 21512 sedan 765 15623.92 21272 sedan 219 33540.54 20925 convertible 550 22064.29 27384 sedan 110 11903.10 25285 coupe [1] "convertible" "coupe" "hatchback" "sedan" "wagon"
%%R
simpleMod <- dummyVars(~Mileage + Type,
data = carSubset,
## Remove the variable name from the
## column name
levelsOnly = TRUE)
print(
simpleMod
)
withInteraction <- dummyVars(~Mileage + Type + Mileage:Type,
data = carSubset,
levelsOnly = TRUE)
print(
withInteraction
)
print(
predict(withInteraction, head(carSubset))
)
Dummy Variable Object
Formula: ~Mileage + Type
2 variables, 1 factors
Factor variable names will be removed
A less than full rank encoding is used
Dummy Variable Object
Formula: ~Mileage + Type + Mileage:Type
2 variables, 1 factors
Factor variable names will be removed
A less than full rank encoding is used
Mileage convertible coupe hatchback sedan wagon Mileage:convertible
635 9049 0 0 0 1 0 0
421 17870 0 0 0 1 0 0
257 26700 0 1 0 0 0 0
221 10340 1 0 0 0 0 10340
642 25557 0 0 0 1 0 0
84 13776 0 1 0 0 0 0
Mileage:coupe Mileage:hatchback Mileage:sedan Mileage:wagon
635 0 0 9049 0
421 0 0 17870 0
257 26700 0 0 0
221 0 0 0 0
642 0 0 25557 0
84 13776 0 0 0
%%R
showChapterScript(4)
NULL
%%R
showChapterOutput(4)
NULL
%%R -w 600 -h 600
runChapterScript(4)
> ################################################################################
> ### R code from Applied Predictive Modeling (2013) by Kuhn and Jo .... [TRUNCATED]
> data(GermanCredit)
> ## First, remove near-zero variance predictors then get rid of a few predictors
> ## that duplicate values. For example, there are two possible val .... [TRUNCATED]
> GermanCredit$CheckingAccountStatus.lt.0 <- NULL
> GermanCredit$SavingsAccountBonds.lt.100 <- NULL
> GermanCredit$EmploymentDuration.lt.1 <- NULL
> GermanCredit$EmploymentDuration.Unemployed <- NULL
> GermanCredit$Personal.Male.Married.Widowed <- NULL
> GermanCredit$Property.Unknown <- NULL
> GermanCredit$Housing.ForFree <- NULL
> ## Split the data into training (80%) and test sets (20%)
> set.seed(100)
> inTrain <- createDataPartition(GermanCredit$Class, p = .8)[[1]]
> GermanCreditTrain <- GermanCredit[ inTrain, ]
> GermanCreditTest <- GermanCredit[-inTrain, ]
> ## The model fitting code shown in the computing section is fairly
> ## simplistic. For the text we estimate the tuning parameter grid
> ## up-fron .... [TRUNCATED]
> set.seed(231)
> sigDist <- sigest(Class ~ ., data = GermanCreditTrain, frac = 1)
> svmTuneGrid <- data.frame(sigma = as.vector(sigDist)[1], C = 2^(-2:7))
> ### Optional: parallel processing can be used via the 'do' packages,
> ### such as doMC, doMPI etc. We used doMC (not on Windows) to speed
> ### up .... [TRUNCATED]
> svmFit <- train(Class ~ .,
+ data = GermanCreditTrain,
+ method = "svmRadial",
+ preProc = c("center ..." ... [TRUNCATED]
> ## classProbs = TRUE was added since the text was written
>
> ## Print the results
> svmFit
Support Vector Machines with Radial Basis Function Kernel
800 samples
41 predictor
2 classes: 'Bad', 'Good'
Pre-processing: centered, scaled
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 720, 720, 720, 720, 720, 720, ...
Resampling results across tuning parameters:
C Accuracy Kappa Accuracy SD Kappa SD
0.25 0.74125 0.3515540 0.05046025 0.1175042
0.50 0.74050 0.3462643 0.05178941 0.1205921
1.00 0.74475 0.3441089 0.05070234 0.1194702
2.00 0.74175 0.3209028 0.04681229 0.1193335
4.00 0.74275 0.3160328 0.04890967 0.1220800
8.00 0.75325 0.3389174 0.04836682 0.1291946
16.00 0.74700 0.3081410 0.04428859 0.1252361
32.00 0.74200 0.2922277 0.04466142 0.1274896
64.00 0.73975 0.2727270 0.04451338 0.1371257
128.00 0.73650 0.2763129 0.04495179 0.1278093
Tuning parameter 'sigma' was held constant at a value of 0.008918477
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were sigma = 0.008918477 and C = 8.
> ## A line plot of the average performance. The 'scales' argument is actually an
> ## argument to xyplot that converts the x-axis to log-2 units.
> .... [TRUNCATED]
> ## Test set predictions
>
> predictedClasses <- predict(svmFit, GermanCreditTest)
> str(predictedClasses)
Factor w/ 2 levels "Bad","Good": 1 2 2 2 1 2 2 2 1 1 ...
> ## Use the "type" option to get class probabilities
>
> predictedProbs <- predict(svmFit, newdata = GermanCreditTest, type = "prob")
> head(predictedProbs)
Bad Good
1 0.58917636 0.4108236
2 0.49818809 0.5018119
3 0.31073860 0.6892614
4 0.08949224 0.9105078
5 0.60453392 0.3954661
6 0.13487103 0.8651290
> ## Fit the same model using different resampling methods. The main syntax change
> ## is the control object.
>
> set.seed(1056)
> svmFit10CV <- train(Class ~ .,
+ data = GermanCreditTrain,
+ method = "svmRadial",
+ pre .... [TRUNCATED]
> svmFit10CV
Support Vector Machines with Radial Basis Function Kernel
800 samples
41 predictor
2 classes: 'Bad', 'Good'
Pre-processing: centered, scaled
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 720, 720, 720, 720, 720, 720, ...
Resampling results across tuning parameters:
C Accuracy Kappa Accuracy SD Kappa SD
0.25 0.70000 0.00000000 0.00000000 0.00000000
0.50 0.71875 0.09343326 0.01886539 0.07094452
1.00 0.74375 0.27692135 0.02224391 0.07950763
2.00 0.75875 0.36149069 0.03230175 0.07626079
4.00 0.75500 0.36809516 0.04216370 0.11887279
8.00 0.76125 0.39541476 0.03653860 0.10447322
16.00 0.76625 0.41855404 0.04168749 0.11283531
32.00 0.74875 0.38824618 0.04427267 0.10316210
64.00 0.72875 0.34921040 0.04715886 0.10823541
128.00 0.72875 0.35220213 0.04678927 0.10785380
Tuning parameter 'sigma' was held constant at a value of 0.008918477
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were sigma = 0.008918477 and C = 16.
> set.seed(1056)
> svmFitLOO <- train(Class ~ .,
+ data = GermanCreditTrain,
+ method = "svmRadial",
+ preProc .... [TRUNCATED]
> svmFitLOO
Support Vector Machines with Radial Basis Function Kernel
800 samples
41 predictor
2 classes: 'Bad', 'Good'
Pre-processing: centered, scaled
Resampling:
Summary of sample sizes: 799, 799, 799, 799, 799, 799, ...
Resampling results across tuning parameters:
C Accuracy Kappa
0.25 0.70000 0.0000000
0.50 0.71750 0.1003185
1.00 0.74875 0.3049793
2.00 0.74000 0.3157895
4.00 0.74875 0.3582375
8.00 0.76125 0.4068323
16.00 0.76125 0.4169719
32.00 0.72250 0.3345324
64.00 0.71625 0.3268090
128.00 0.72000 0.3333333
Tuning parameter 'sigma' was held constant at a value of 0.008918477
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were sigma = 0.008918477 and C = 8.
> set.seed(1056)
> svmFitLGO <- train(Class ~ .,
+ data = GermanCreditTrain,
+ method = "svmRadial",
+ preProc .... [TRUNCATED]
> svmFitLGO
Support Vector Machines with Radial Basis Function Kernel
800 samples
41 predictor
2 classes: 'Bad', 'Good'
Pre-processing: centered, scaled
Resampling: Repeated Train/Test Splits Estimated (50 reps, 0.8%)
Summary of sample sizes: 640, 640, 640, 640, 640, 640, ...
Resampling results across tuning parameters:
C Accuracy Kappa Accuracy SD Kappa SD
0.25 0.700000 0.00000000 0.000000000 0.00000000
0.50 0.711125 0.06691009 0.009557326 0.03877930
1.00 0.737000 0.25887472 0.022440397 0.06320724
2.00 0.740750 0.31816867 0.023765435 0.06074014
4.00 0.743125 0.35076031 0.028071803 0.06804724
8.00 0.745000 0.36985984 0.025222227 0.06174940
16.00 0.738500 0.36501972 0.030445250 0.07631435
32.00 0.729375 0.34893389 0.029646353 0.07227117
64.00 0.721500 0.33509585 0.029346627 0.07130233
128.00 0.714375 0.32063672 0.030389951 0.07486036
Tuning parameter 'sigma' was held constant at a value of 0.008918477
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were sigma = 0.008918477 and C = 8.
> set.seed(1056)
> svmFitBoot <- train(Class ~ .,
+ data = GermanCreditTrain,
+ method = "svmRadial",
+ pre .... [TRUNCATED]
> svmFitBoot
Support Vector Machines with Radial Basis Function Kernel
800 samples
41 predictor
2 classes: 'Bad', 'Good'
Pre-processing: centered, scaled
Resampling: Bootstrapped (50 reps)
Summary of sample sizes: 800, 800, 800, 800, 800, 800, ...
Resampling results across tuning parameters:
C Accuracy Kappa Accuracy SD Kappa SD
0.25 0.7040934 0.01896068 0.02637422 0.03273562
0.50 0.7275975 0.18611337 0.03062648 0.08794391
1.00 0.7388778 0.29026235 0.02445672 0.06765864
2.00 0.7420822 0.32895315 0.01767895 0.05040255
4.00 0.7421938 0.34486682 0.01833609 0.04747891
8.00 0.7405316 0.35362257 0.01907557 0.05017752
16.00 0.7349648 0.34738355 0.01916738 0.04500902
32.00 0.7294466 0.34058430 0.02168677 0.04904437
64.00 0.7234922 0.32974005 0.02297203 0.05086115
128.00 0.7209653 0.32439609 0.02321969 0.05087069
Tuning parameter 'sigma' was held constant at a value of 0.008918477
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were sigma = 0.008918477 and C = 4.
> set.seed(1056)
> svmFitBoot632 <- train(Class ~ .,
+ data = GermanCreditTrain,
+ method = "svmRadial",
+ .... [TRUNCATED]
> svmFitBoot632
Support Vector Machines with Radial Basis Function Kernel
800 samples
41 predictor
2 classes: 'Bad', 'Good'
Pre-processing: centered, scaled
Resampling: Bootstrapped (50 reps)
Summary of sample sizes: 800, 800, 800, 800, 800, 800, ...
Resampling results across tuning parameters:
C Accuracy Kappa Accuracy SD Kappa SD
0.25 0.7025875 0.01198544 0.02637422 0.03273562
0.50 0.7330798 0.18856955 0.03062648 0.08794391
1.00 0.7655020 0.35980922 0.02445672 0.06765864
2.00 0.7827026 0.43450963 0.01767895 0.05040255
4.00 0.7979482 0.48754429 0.01833609 0.04747891
8.00 0.8102331 0.52782744 0.01907557 0.05017752
16.00 0.8177506 0.55166437 0.01916738 0.04500902
32.00 0.8229996 0.56881674 0.02168677 0.04904437
64.00 0.8219948 0.56862328 0.02297203 0.05086115
128.00 0.8226967 0.57074712 0.02321969 0.05087069
Tuning parameter 'sigma' was held constant at a value of 0.008918477
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were sigma = 0.008918477 and C = 32.
> ################################################################################
> ### Section 4.8 Choosing Between Models
>
> set.seed(1056)
> glmProfile <- train(Class ~ .,
+ data = GermanCreditTrain,
+ method = "glm",
+ trControl .... [TRUNCATED]
> glmProfile
Generalized Linear Model
800 samples
41 predictor
2 classes: 'Bad', 'Good'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 720, 720, 720, 720, 720, 720, ...
Resampling results
Accuracy Kappa Accuracy SD Kappa SD
0.749 0.3647664 0.05162166 0.1218109
> resamp <- resamples(list(SVM = svmFit, Logistic = glmProfile))
> summary(resamp)
Call:
summary.resamples(object = resamp)
Models: SVM, Logistic
Number of resamples: 50
Accuracy
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
SVM 0.6500 0.725 0.7625 0.7532 0.7969 0.8375 0
Logistic 0.6125 0.725 0.7562 0.7490 0.7844 0.8500 0
Kappa
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
SVM 0.02778 0.2445 0.3667 0.3389 0.4444 0.5548 0
Logistic 0.07534 0.2831 0.3750 0.3648 0.4504 0.6250 0
> ## These results are slightly different from those shown in the text.
> ## There are some differences in the train() function since the
> ## origin .... [TRUNCATED]
> summary(modelDifferences)
Call:
summary.diff.resamples(object = modelDifferences)
p-value adjustment: bonferroni
Upper diagonal: estimates of the difference
Lower diagonal: p-value for H0: difference = 0
Accuracy
SVM Logistic
SVM 0.00425
Logistic 0.4585
Kappa
SVM Logistic
SVM -0.02585
Logistic 0.07948
> ## The actual paired t-test:
> modelDifferences$statistics$Accuracy
$SVM.diff.Logistic
One Sample t-test
data: x
t = 0.7472, df = 49, p-value = 0.4585
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-0.007179558 0.015679558
sample estimates:
mean of x
0.00425
> ################################################################################
> ### Session Information
>
> sessionInfo()
R version 3.1.3 (2015-03-09)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.3 (Yosemite)
locale:
[1] C
attached base packages:
[1] parallel tools stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] kernlab_0.9-20 corrplot_0.73
[3] ellipse_0.3-8 e1071_1.6-4
[5] earth_4.2.0 plotrix_3.5-11
[7] plotmo_2.2.1 doMC_1.3.3
[9] iterators_1.0.7 foreach_1.4.2
[11] AppliedPredictiveModeling_1.1-6 caret_6.0-41
[13] ggplot2_1.0.1 lattice_0.20-31
loaded via a namespace (and not attached):
[1] BradleyTerry2_1.0-6 CORElearn_0.9.45 MASS_7.3-40
[4] Matrix_1.1-5 Rcpp_0.11.5 SparseM_1.6
[7] brglm_0.5-9 car_2.0-25 class_7.3-12
[10] cluster_2.0.1 codetools_0.2-10 colorspace_1.2-6
[13] compiler_3.1.3 digest_0.6.8 grid_3.1.3
[16] gtable_0.1.2 gtools_3.4.1 lme4_1.1-7
[19] mgcv_1.8-4 minqa_1.2.4 munsell_0.4.2
[22] nlme_3.1-120 nloptr_1.0.4 nnet_7.3-9
[25] pbkrtest_0.4-2 plyr_1.8.1 proto_0.3-10
[28] quantreg_5.11 reshape2_1.4.1 rpart_4.1-9
[31] scales_0.2.4 splines_3.1.3 stringr_0.6.2
> ### q("no")
>
>
>
%%R
minutes_required_for_previous_script = 3260.432 / 60
print(minutes_required_for_previous_script)
## user system elapsed
## 3260.432 211.968 906.933
[1] 54.34053
%%R
######## This computation can take five minutes to complete on a single cpu.
### Section 4.6 Choosing Final Tuning Parameters
detach(package:caret) # reload the package, since the code here modifies GermanCredit
library(caret)
data(GermanCredit)
## First, remove near-zero variance predictors then get rid of a few predictors
## that duplicate values. For example, there are two possible values for the
## housing variable: "Rent", "Own" and "ForFree". So that we don't have linear
## dependencies, we get rid of one of the levels (e.g. "ForFree")
GermanCredit <- GermanCredit[, -nearZeroVar(GermanCredit)]
GermanCredit$CheckingAccountStatus.lt.0 <- NULL
GermanCredit$SavingsAccountBonds.lt.100 <- NULL
GermanCredit$EmploymentDuration.lt.1 <- NULL
GermanCredit$EmploymentDuration.Unemployed <- NULL
GermanCredit$Personal.Male.Married.Widowed <- NULL
GermanCredit$Property.Unknown <- NULL
GermanCredit$Housing.ForFree <- NULL
## Split the data into training (80%) and test sets (20%)
set.seed(100)
inTrain <- createDataPartition(GermanCredit$Class, p = .8)[[1]]
GermanCreditTrain <- GermanCredit[ inTrain, ]
GermanCreditTest <- GermanCredit[-inTrain, ]
## The model fitting code shown in the computing section is fairly
## simplistic. For the text we estimate the tuning parameter grid
## up-front and pass it in explicitly. This generally is not needed,
## but was used here so that we could trim the cost values to a
## presentable range and to re-use later with different resampling
## methods.
library(kernlab)
set.seed(231)
sigDist <- sigest(Class ~ ., data = GermanCreditTrain, frac = 1)
svmTuneGrid <- data.frame(sigma = as.vector(sigDist)[1], C = 2^(-2:7))
### Optional: parallel processing can be used via the 'do' packages,
### such as doMC, doMPI etc. We used doMC (not on Windows) to speed
### up the computations.
### WARNING: Be aware of how much memory is needed to parallel
### process. It can very quickly overwhelm the available hardware. We
### estimate the memory usage (VSIZE = total memory size) to be
### 2566M/core.
### library(doMC)
### registerDoMC(4)
set.seed(1056)
svmFit <- train(Class ~ .,
data = GermanCreditTrain,
method = "svmRadial",
preProc = c("center", "scale"),
tuneGrid = svmTuneGrid,
trControl = trainControl(method = "repeatedcv",
repeats = 5,
classProbs = TRUE))
## classProbs = TRUE was added since the text was written
## Print the results
print(
svmFit
)
## A line plot of the average performance. The 'scales' argument is actually an
## argument to xyplot that converts the x-axis to log-2 units.
print(
plot(svmFit, scales = list(x = list(log = 2)))
)
## Test set predictions
predictedClasses <- predict(svmFit, GermanCreditTest)
print(
str(predictedClasses)
)
## Use the "type" option to get class probabilities
predictedProbs <- predict(svmFit, newdata = GermanCreditTest, type = "prob")
print(
head(predictedProbs)
)
Attaching package: 'caret'
The following object is masked from 'package:pls':
R2
Support Vector Machines with Radial Basis Function Kernel
800 samples
41 predictor
2 classes: 'Bad', 'Good'
Pre-processing: centered, scaled
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 720, 720, 720, 720, 720, 720, ...
Resampling results across tuning parameters:
C Accuracy Kappa Accuracy SD Kappa SD
0.25 0.74125 0.3515540 0.05046025 0.1175042
0.50 0.74050 0.3462643 0.05178941 0.1205921
1.00 0.74475 0.3441089 0.05070234 0.1194702
2.00 0.74175 0.3209028 0.04681229 0.1193335
4.00 0.74275 0.3160328 0.04890967 0.1220800
8.00 0.75325 0.3389174 0.04836682 0.1291946
16.00 0.74700 0.3081410 0.04428859 0.1252361
32.00 0.74200 0.2922277 0.04466142 0.1274896
64.00 0.73975 0.2727270 0.04451338 0.1371257
128.00 0.73650 0.2763129 0.04495179 0.1278093
Tuning parameter 'sigma' was held constant at a value of 0.008918477
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were sigma = 0.008918477 and C = 8.
Factor w/ 2 levels "Bad","Good": 1 2 2 2 1 2 2 2 1 1 ...
NULL
Bad Good
1 0.58917636 0.4108236
2 0.49818809 0.5018119
3 0.31073860 0.6892614
4 0.08949224 0.9105078
5 0.60453392 0.3954661
6 0.13487103 0.8651290
%%R
######## This computation can take over a half hour to complete on a single cpu.
## Fit the same model using different resampling methods. The main syntax change
## is the control object.
set.seed(1056)
svmFit10CV <- train(Class ~ .,
data = GermanCreditTrain,
method = "svmRadial",
preProc = c("center", "scale"),
tuneGrid = svmTuneGrid,
trControl = trainControl(method = "cv", number = 10))
print(
svmFit10CV
)
set.seed(1056)
svmFitLOO <- train(Class ~ .,
data = GermanCreditTrain,
method = "svmRadial",
preProc = c("center", "scale"),
tuneGrid = svmTuneGrid,
trControl = trainControl(method = "LOOCV"))
print(
svmFitLOO
)
set.seed(1056)
svmFitLGO <- train(Class ~ .,
data = GermanCreditTrain,
method = "svmRadial",
preProc = c("center", "scale"),
tuneGrid = svmTuneGrid,
trControl = trainControl(method = "LGOCV",
number = 50,
p = .8))
print(
svmFitLGO
)
set.seed(1056)
svmFitBoot <- train(Class ~ .,
data = GermanCreditTrain,
method = "svmRadial",
preProc = c("center", "scale"),
tuneGrid = svmTuneGrid,
trControl = trainControl(method = "boot", number = 50))
print(
svmFitBoot
)
set.seed(1056)
svmFitBoot632 <- train(Class ~ .,
data = GermanCreditTrain,
method = "svmRadial",
preProc = c("center", "scale"),
tuneGrid = svmTuneGrid,
trControl = trainControl(method = "boot632",
number = 50))
print(
svmFitBoot632
)
Support Vector Machines with Radial Basis Function Kernel
800 samples
41 predictor
2 classes: 'Bad', 'Good'
Pre-processing: centered, scaled
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 720, 720, 720, 720, 720, 720, ...
Resampling results across tuning parameters:
C Accuracy Kappa Accuracy SD Kappa SD
0.25 0.70000 0.00000000 0.00000000 0.00000000
0.50 0.71875 0.09343326 0.01886539 0.07094452
1.00 0.74375 0.27692135 0.02224391 0.07950763
2.00 0.75875 0.36149069 0.03230175 0.07626079
4.00 0.75500 0.36809516 0.04216370 0.11887279
8.00 0.76125 0.39541476 0.03653860 0.10447322
16.00 0.76625 0.41855404 0.04168749 0.11283531
32.00 0.74875 0.38824618 0.04427267 0.10316210
64.00 0.72875 0.34921040 0.04715886 0.10823541
128.00 0.72875 0.35220213 0.04678927 0.10785380
Tuning parameter 'sigma' was held constant at a value of 0.008918477
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were sigma = 0.008918477 and C = 16.
Support Vector Machines with Radial Basis Function Kernel
800 samples
41 predictor
2 classes: 'Bad', 'Good'
Pre-processing: centered, scaled
Resampling:
Summary of sample sizes: 799, 799, 799, 799, 799, 799, ...
Resampling results across tuning parameters:
C Accuracy Kappa
0.25 0.70000 0.0000000
0.50 0.71750 0.1003185
1.00 0.74875 0.3049793
2.00 0.74000 0.3157895
4.00 0.74875 0.3582375
8.00 0.76125 0.4068323
16.00 0.76125 0.4169719
32.00 0.72250 0.3345324
64.00 0.71625 0.3268090
128.00 0.72000 0.3333333
Tuning parameter 'sigma' was held constant at a value of 0.008918477
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were sigma = 0.008918477 and C = 8.
Support Vector Machines with Radial Basis Function Kernel
800 samples
41 predictor
2 classes: 'Bad', 'Good'
Pre-processing: centered, scaled
Resampling: Repeated Train/Test Splits Estimated (50 reps, 0.8%)
Summary of sample sizes: 640, 640, 640, 640, 640, 640, ...
Resampling results across tuning parameters:
C Accuracy Kappa Accuracy SD Kappa SD
0.25 0.700000 0.00000000 0.000000000 0.00000000
0.50 0.711125 0.06691009 0.009557326 0.03877930
1.00 0.737000 0.25887472 0.022440397 0.06320724
2.00 0.740750 0.31816867 0.023765435 0.06074014
4.00 0.743125 0.35076031 0.028071803 0.06804724
8.00 0.745000 0.36985984 0.025222227 0.06174940
16.00 0.738500 0.36501972 0.030445250 0.07631435
32.00 0.729375 0.34893389 0.029646353 0.07227117
64.00 0.721500 0.33509585 0.029346627 0.07130233
128.00 0.714375 0.32063672 0.030389951 0.07486036
Tuning parameter 'sigma' was held constant at a value of 0.008918477
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were sigma = 0.008918477 and C = 8.
Support Vector Machines with Radial Basis Function Kernel
800 samples
41 predictor
2 classes: 'Bad', 'Good'
Pre-processing: centered, scaled
Resampling: Bootstrapped (50 reps)
Summary of sample sizes: 800, 800, 800, 800, 800, 800, ...
Resampling results across tuning parameters:
C Accuracy Kappa Accuracy SD Kappa SD
0.25 0.7040934 0.01896068 0.02637422 0.03273562
0.50 0.7275975 0.18611337 0.03062648 0.08794391
1.00 0.7388778 0.29026235 0.02445672 0.06765864
2.00 0.7420822 0.32895315 0.01767895 0.05040255
4.00 0.7421938 0.34486682 0.01833609 0.04747891
8.00 0.7405316 0.35362257 0.01907557 0.05017752
16.00 0.7349648 0.34738355 0.01916738 0.04500902
32.00 0.7294466 0.34058430 0.02168677 0.04904437
64.00 0.7234922 0.32974005 0.02297203 0.05086115
128.00 0.7209653 0.32439609 0.02321969 0.05087069
Tuning parameter 'sigma' was held constant at a value of 0.008918477
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were sigma = 0.008918477 and C = 4.
Support Vector Machines with Radial Basis Function Kernel
800 samples
41 predictor
2 classes: 'Bad', 'Good'
Pre-processing: centered, scaled
Resampling: Bootstrapped (50 reps)
Summary of sample sizes: 800, 800, 800, 800, 800, 800, ...
Resampling results across tuning parameters:
C Accuracy Kappa Accuracy SD Kappa SD
0.25 0.7025875 0.01198544 0.02637422 0.03273562
0.50 0.7330798 0.18856955 0.03062648 0.08794391
1.00 0.7655020 0.35980922 0.02445672 0.06765864
2.00 0.7827026 0.43450963 0.01767895 0.05040255
4.00 0.7979482 0.48754429 0.01833609 0.04747891
8.00 0.8102331 0.52782744 0.01907557 0.05017752
16.00 0.8177506 0.55166437 0.01916738 0.04500902
32.00 0.8229996 0.56881674 0.02168677 0.04904437
64.00 0.8219948 0.56862328 0.02297203 0.05086115
128.00 0.8226967 0.57074712 0.02321969 0.05087069
Tuning parameter 'sigma' was held constant at a value of 0.008918477
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were sigma = 0.008918477 and C = 32.
%%R
### Section 4.8 Choosing Between Models
set.seed(1056)
glmProfile <- train(Class ~ .,
data = GermanCreditTrain,
method = "glm",
trControl = trainControl(method = "repeatedcv",
repeats = 5))
print(
glmProfile
)
resamp <- resamples(list(SVM = svmFit, Logistic = glmProfile))
print(
summary(resamp)
)
## These results are slightly different from those shown in the text.
## There are some differences in the train() function since the
## original results were produced. This is due to a difference in
## predictions from the ksvm() function when class probs are requested
## and when they are not. See, for example,
## https://stat.ethz.ch/pipermail/r-help/2013-November/363188.html
modelDifferences <- diff(resamp)
print(
summary(modelDifferences)
)
## The actual paired t-test:
print(
modelDifferences$statistics$Accuracy
)
Generalized Linear Model
800 samples
41 predictor
2 classes: 'Bad', 'Good'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 720, 720, 720, 720, 720, 720, ...
Resampling results
Accuracy Kappa Accuracy SD Kappa SD
0.749 0.3647664 0.05162166 0.1218109
Call:
summary.resamples(object = resamp)
Models: SVM, Logistic
Number of resamples: 50
Accuracy
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
SVM 0.6500 0.725 0.7625 0.7532 0.7969 0.8375 0
Logistic 0.6125 0.725 0.7562 0.7490 0.7844 0.8500 0
Kappa
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
SVM 0.02778 0.2445 0.3667 0.3389 0.4444 0.5548 0
Logistic 0.07534 0.2831 0.3750 0.3648 0.4504 0.6250 0
Call:
summary.diff.resamples(object = modelDifferences)
p-value adjustment: bonferroni
Upper diagonal: estimates of the difference
Lower diagonal: p-value for H0: difference = 0
Accuracy
SVM Logistic
SVM 0.00425
Logistic 0.4585
Kappa
SVM Logistic
SVM -0.02585
Logistic 0.07948
$SVM.diff.Logistic
One Sample t-test
data: x
t = 0.7472, df = 49, p-value = 0.4585
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-0.007179558 0.015679558
sample estimates:
mean of x
0.00425
%%R
showChapterScript(6)
NULL
%%R
showChapterOutput(6)
NULL
%%R -w 600 -h 600
runChapterScript(6)
## user system elapsed
## 540.993 74.917 615.942
> ################################################################################
> ### R code from Applied Predictive Modeling (2013) by Kuhn and Jo .... [TRUNCATED]
> data(solubility)
> library(lattice)
> ### Some initial plots of the data
>
> xyplot(solTrainY ~ solTrainX$MolWeight, type = c("p", "g"),
+ ylab = "Solubility (log)",
+ mai .... [TRUNCATED]
> xyplot(solTrainY ~ solTrainX$NumRotBonds, type = c("p", "g"),
+ ylab = "Solubility (log)",
+ xlab = "Number of Rotatable Bonds")
> bwplot(solTrainY ~ ifelse(solTrainX[,100] == 1,
+ "structure present",
+ "structure absent"),
.... [TRUNCATED]
> ### Find the columns that are not fingerprints (i.e. the continuous
> ### predictors). grep will return a list of integers corresponding to
> ### co .... [TRUNCATED]
> library(caret)
> featurePlot(solTrainXtrans[, -notFingerprints],
+ solTrainY,
+ between = list(x = 1, y = 1),
+ type = c("g", "p" .... [TRUNCATED]
> library(corrplot)
> ### We used the full namespace to call this function because the pls
> ### package (also used in this chapter) has a function with the same
> ### na .... [TRUNCATED]
> ################################################################################
> ### Section 6.2 Linear Regression
>
> ### Create a control funct .... [TRUNCATED]
> indx <- createFolds(solTrainY, returnTrain = TRUE)
> ctrl <- trainControl(method = "cv", index = indx)
> ### Linear regression model with all of the predictors. This will
> ### produce some warnings that a 'rank-deficient fit may be
> ### misleading'. T .... [TRUNCATED]
> lmTune0 <- train(x = solTrainXtrans, y = solTrainY,
+ method = "lm",
+ trControl = ctrl)
> lmTune0
Linear Regression
951 samples
228 predictors
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 856, 857, 855, 856, 856, 855, ...
Resampling results
RMSE Rsquared RMSE SD Rsquared SD
0.7210355 0.8768359 0.06998223 0.02467069
> ### And another using a set of predictors reduced by unsupervised
> ### filtering. We apply a filter to reduce extreme between-predictor
> ### corre .... [TRUNCATED]
> trainXfiltered <- solTrainXtrans[, -tooHigh]
> testXfiltered <- solTestXtrans[, -tooHigh]
> set.seed(100)
> lmTune <- train(x = trainXfiltered, y = solTrainY,
+ method = "lm",
+ trControl = ctrl)
> lmTune
Linear Regression
951 samples
190 predictors
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 856, 857, 855, 856, 856, 855, ...
Resampling results
RMSE Rsquared RMSE SD Rsquared SD
0.7113935 0.8793396 0.06320545 0.02434305
> ### Save the test set results in a data frame
> testResults <- data.frame(obs = solTestY,
+ Linear_Regres .... [TRUNCATED]
> ################################################################################
> ### Section 6.3 Partial Least Squares
>
> ## Run PLS and PCR on .... [TRUNCATED]
> plsTune <- train(x = solTrainXtrans, y = solTrainY,
+ method = "pls",
+ tuneGrid = expand.grid(ncomp = 1:20),
+ .... [TRUNCATED]
Loading required package: pls
Attaching package: 'pls'
The following object is masked from 'package:corrplot':
corrplot
The following object is masked from 'package:caret':
R2
The following object is masked from 'package:stats':
loadings
> plsTune
Partial Least Squares
951 samples
228 predictors
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 856, 857, 855, 856, 856, 855, ...
Resampling results across tuning parameters:
ncomp RMSE Rsquared RMSE SD Rsquared SD
1 1.7543811 0.2630495 0.08396462 0.06500848
2 1.2720647 0.6128490 0.07938883 0.05345622
3 1.0373646 0.7432147 0.07155432 0.02761174
4 0.8370618 0.8317217 0.05615036 0.02574808
5 0.7458318 0.8660461 0.03778846 0.01932122
6 0.7106591 0.8779019 0.03432693 0.02281696
7 0.6921293 0.8841448 0.03794937 0.02403533
8 0.6908481 0.8851647 0.03282238 0.01967729
9 0.6828771 0.8877056 0.02910576 0.01851863
10 0.6824521 0.8879195 0.03050242 0.01870212
11 0.6826719 0.8878955 0.02914169 0.01953986
12 0.6847473 0.8872488 0.03726823 0.01936983
13 0.6836698 0.8875568 0.03972887 0.01935437
14 0.6856134 0.8871389 0.03984337 0.01855409
15 0.6867190 0.8869351 0.04224044 0.01944079
16 0.6860797 0.8872705 0.04359318 0.02079411
17 0.6881636 0.8866078 0.04626247 0.02130103
18 0.6926077 0.8853743 0.04810637 0.02213141
19 0.6943936 0.8848611 0.04858541 0.02206531
20 0.6977396 0.8837453 0.05295825 0.02247232
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was ncomp = 10.
> testResults$PLS <- predict(plsTune, solTestXtrans)
> set.seed(100)
> pcrTune <- train(x = solTrainXtrans, y = solTrainY,
+ method = "pcr",
+ tuneGrid = expand.grid(ncomp = 1:35),
+ .... [TRUNCATED]
> pcrTune
Principal Component Analysis
951 samples
228 predictors
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 856, 857, 855, 856, 856, 855, ...
Resampling results across tuning parameters:
ncomp RMSE Rsquared RMSE SD Rsquared SD
1 1.9778920 0.06590758 0.11043847 0.03465612
2 1.6379400 0.36202127 0.09825075 0.08480717
3 1.3655645 0.55546442 0.09395858 0.04528156
4 1.3715028 0.55157507 0.09810878 0.04757889
5 1.3415864 0.57099834 0.10467614 0.06166222
6 1.2081745 0.64973828 0.08788513 0.06148380
7 1.1818622 0.66578017 0.10108519 0.06050609
8 1.1452119 0.68759737 0.07782801 0.04078188
9 1.0495852 0.73655117 0.08201882 0.03697880
10 1.0063822 0.75723962 0.09589129 0.04169283
11 0.9723334 0.77443568 0.07775156 0.02843482
12 0.9692845 0.77566291 0.07887512 0.02905775
13 0.9526792 0.78316647 0.07637597 0.02724077
14 0.9396590 0.78895459 0.07056722 0.02444445
15 0.9419390 0.78796957 0.06837934 0.02414867
16 0.8695211 0.81842614 0.04668856 0.02511778
17 0.8699482 0.81825536 0.04575858 0.02485892
18 0.8719274 0.81723654 0.04753794 0.02576886
19 0.8695726 0.81824845 0.04727016 0.02659831
20 0.8682556 0.81894961 0.04730875 0.02681389
21 0.8096228 0.84189134 0.04576547 0.02447005
22 0.8122517 0.84082141 0.04477924 0.02426518
23 0.8093641 0.84200427 0.04457044 0.02513324
24 0.8096163 0.84210474 0.04011203 0.02327652
25 0.8095766 0.84208293 0.03900307 0.02355872
26 0.8049366 0.84421798 0.03676154 0.02129394
27 0.8039803 0.84465744 0.03378393 0.02036649
28 0.8056953 0.84397657 0.03395966 0.02100737
29 0.7863312 0.85146390 0.03603401 0.01889728
30 0.7819408 0.85271068 0.03068473 0.02057117
31 0.7795830 0.85355495 0.02832846 0.02096832
32 0.7757032 0.85503975 0.03571378 0.02166955
33 0.7395733 0.86853408 0.03063334 0.01813624
34 0.7327021 0.87065692 0.03102043 0.02117680
35 0.7307134 0.87142813 0.03570471 0.02195190
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was ncomp = 35.
> plsResamples <- plsTune$results
> plsResamples$Model <- "PLS"
> pcrResamples <- pcrTune$results
> pcrResamples$Model <- "PCR"
> plsPlotData <- rbind(plsResamples, pcrResamples)
> xyplot(RMSE ~ ncomp,
+ data = plsPlotData,
+ #aspect = 1,
+ xlab = "# Components",
+ ylab = "RMSE (Cross-Validation)",
+ .... [TRUNCATED]
> plsImp <- varImp(plsTune, scale = FALSE)
> plot(plsImp, top = 25, scales = list(y = list(cex = .95)))
> ################################################################################
> ### Section 6.4 Penalized Models
>
> ## The text used the elasti .... [TRUNCATED]
> set.seed(100)
> ridgeTune <- train(x = solTrainXtrans, y = solTrainY,
+ method = "ridge",
+ tuneGrid = ridgeGrid,
+ .... [TRUNCATED]
Loading required package: elasticnet
Loading required package: lars
Loaded lars 1.2
> ridgeTune
Ridge Regression
951 samples
228 predictors
Pre-processing: centered, scaled
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 856, 857, 855, 856, 856, 855, ...
Resampling results across tuning parameters:
lambda RMSE Rsquared RMSE SD Rsquared SD
0.000000000 0.7207117 0.8769717 0.06994063 0.02450628
0.007142857 0.7047552 0.8818659 0.04495581 0.01988253
0.014285714 0.6964731 0.8847911 0.04051497 0.01867276
0.021428571 0.6925923 0.8862699 0.03781419 0.01797165
0.028571429 0.6908607 0.8870609 0.03593594 0.01748178
0.035714286 0.6904220 0.8874561 0.03457159 0.01710886
0.042857143 0.6908548 0.8875998 0.03357310 0.01681167
0.050000000 0.6919207 0.8875741 0.03285297 0.01656815
0.057142857 0.6934783 0.8874278 0.03234969 0.01636278
0.064285714 0.6954114 0.8872009 0.03202921 0.01619286
0.071428571 0.6976723 0.8869096 0.03185067 0.01604581
0.078571429 0.7002069 0.8865723 0.03179153 0.01591906
0.085714286 0.7029801 0.8862009 0.03183151 0.01580906
0.092857143 0.7059656 0.8858041 0.03195417 0.01571305
0.100000000 0.7091432 0.8853885 0.03214610 0.01562886
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was lambda = 0.03571429.
> print(update(plot(ridgeTune), xlab = "Penalty"))
> enetGrid <- expand.grid(lambda = c(0, 0.01, .1),
+ fraction = seq(.05, 1, length = 20))
> set.seed(100)
> enetTune <- train(x = solTrainXtrans, y = solTrainY,
+ method = "enet",
+ tuneGrid = enetGrid,
+ .... [TRUNCATED]
> enetTune
Elasticnet
951 samples
228 predictors
Pre-processing: centered, scaled
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 856, 857, 855, 856, 856, 855, ...
Resampling results across tuning parameters:
lambda fraction RMSE Rsquared RMSE SD Rsquared SD
0.00 0.05 0.8713747 0.8337289 0.03816148 0.02737681
0.00 0.10 0.6882637 0.8858786 0.04298815 0.02064030
0.00 0.15 0.6729264 0.8907993 0.03942228 0.01837582
0.00 0.20 0.6754697 0.8903865 0.03807506 0.01760700
0.00 0.25 0.6879252 0.8865202 0.04383623 0.01946378
0.00 0.30 0.6971062 0.8836414 0.04812788 0.02058289
0.00 0.35 0.7062274 0.8808469 0.05191262 0.02155822
0.00 0.40 0.7125900 0.8788942 0.05345207 0.02192952
0.00 0.45 0.7138742 0.8785588 0.05342746 0.02178996
0.00 0.50 0.7141235 0.8785622 0.05461747 0.02183522
0.00 0.55 0.7144669 0.8784961 0.05583323 0.02211744
0.00 0.60 0.7140532 0.8786593 0.05739702 0.02234513
0.00 0.65 0.7140599 0.8786880 0.05941448 0.02265512
0.00 0.70 0.7145464 0.8785744 0.06116481 0.02298579
0.00 0.75 0.7151011 0.8784348 0.06289926 0.02335653
0.00 0.80 0.7158067 0.8782629 0.06453350 0.02366829
0.00 0.85 0.7167918 0.8780158 0.06564865 0.02383283
0.00 0.90 0.7178711 0.8777467 0.06672370 0.02398923
0.00 0.95 0.7191448 0.8774055 0.06834509 0.02424302
0.00 1.00 0.7207117 0.8769717 0.06994063 0.02450628
0.01 0.05 1.5168857 0.6435177 0.11013983 0.07875588
0.01 0.10 1.1324481 0.7671388 0.07499369 0.04771971
0.01 0.15 0.9061843 0.8241043 0.05601707 0.02997353
0.01 0.20 0.7855269 0.8571170 0.04929439 0.02173949
0.01 0.25 0.7296380 0.8733531 0.04166558 0.02066970
0.01 0.30 0.6989522 0.8826020 0.04255257 0.02028148
0.01 0.35 0.6866513 0.8863490 0.04212287 0.01967040
0.01 0.40 0.6806730 0.8884346 0.03999669 0.01852187
0.01 0.45 0.6778780 0.8895285 0.03610764 0.01717676
0.01 0.50 0.6760780 0.8902871 0.03307570 0.01620142
0.01 0.55 0.6743998 0.8909724 0.03065386 0.01569024
0.01 0.60 0.6746777 0.8910026 0.03042481 0.01580700
0.01 0.65 0.6765522 0.8904906 0.03177438 0.01642381
0.01 0.70 0.6796775 0.8895768 0.03364893 0.01711767
0.01 0.75 0.6829651 0.8886182 0.03551058 0.01757998
0.01 0.80 0.6862396 0.8876472 0.03719803 0.01791970
0.01 0.85 0.6895735 0.8866477 0.03885651 0.01822379
0.01 0.90 0.6930103 0.8856210 0.04047457 0.01858065
0.01 0.95 0.6968398 0.8844630 0.04181671 0.01895729
0.01 1.00 0.7006283 0.8833050 0.04284382 0.01929610
0.10 0.05 1.6867967 0.5157969 0.13154407 0.08882307
0.10 0.10 1.4058744 0.6954146 0.10735405 0.06584337
0.10 0.15 1.1697385 0.7596795 0.08648027 0.04623881
0.10 0.20 1.0082617 0.7880698 0.06594126 0.03758966
0.10 0.25 0.8950440 0.8218825 0.05827006 0.02812113
0.10 0.30 0.8193443 0.8435444 0.05167792 0.02222192
0.10 0.35 0.7744593 0.8570276 0.04722049 0.02081488
0.10 0.40 0.7519611 0.8644826 0.04182081 0.01957350
0.10 0.45 0.7343282 0.8710631 0.03806132 0.01874198
0.10 0.50 0.7245543 0.8750318 0.03539926 0.01842909
0.10 0.55 0.7180823 0.8778937 0.03288742 0.01794844
0.10 0.60 0.7137901 0.8799906 0.03184857 0.01756183
0.10 0.65 0.7110967 0.8815343 0.03100037 0.01695475
0.10 0.70 0.7104058 0.8823940 0.02973462 0.01635597
0.10 0.75 0.7103284 0.8829674 0.02952904 0.01597719
0.10 0.80 0.7097899 0.8836319 0.03000022 0.01578241
0.10 0.85 0.7093246 0.8842290 0.03064013 0.01567030
0.10 0.90 0.7094949 0.8845954 0.03109508 0.01554030
0.10 0.95 0.7094181 0.8849823 0.03169989 0.01554197
0.10 1.00 0.7091432 0.8853885 0.03214610 0.01562886
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were fraction = 0.15 and lambda = 0.
> plot(enetTune)
> testResults$Enet <- predict(enetTune, solTestXtrans)
> ################################################################################
> ### Session Information
>
> sessionInfo()
R version 3.1.3 (2015-03-09)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.3 (Yosemite)
locale:
[1] C
attached base packages:
[1] parallel tools stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] elasticnet_1.1 lars_1.2
[3] pls_2.4-3 kernlab_0.9-20
[5] corrplot_0.73 ellipse_0.3-8
[7] e1071_1.6-4 earth_4.2.0
[9] plotrix_3.5-11 plotmo_2.2.1
[11] doMC_1.3.3 iterators_1.0.7
[13] foreach_1.4.2 AppliedPredictiveModeling_1.1-6
[15] caret_6.0-41 ggplot2_1.0.1
[17] lattice_0.20-31
loaded via a namespace (and not attached):
[1] BradleyTerry2_1.0-6 CORElearn_0.9.45 MASS_7.3-40
[4] Matrix_1.1-5 Rcpp_0.11.5 SparseM_1.6
[7] brglm_0.5-9 car_2.0-25 class_7.3-12
[10] cluster_2.0.1 codetools_0.2-10 colorspace_1.2-6
[13] compiler_3.1.3 digest_0.6.8 grid_3.1.3
[16] gtable_0.1.2 gtools_3.4.1 lme4_1.1-7
[19] mgcv_1.8-4 minqa_1.2.4 munsell_0.4.2
[22] nlme_3.1-120 nloptr_1.0.4 nnet_7.3-9
[25] pbkrtest_0.4-2 plyr_1.8.1 proto_0.3-10
[28] quantreg_5.11 reshape2_1.4.1 rpart_4.1-9
[31] scales_0.2.4 splines_3.1.3 stringr_0.6.2
> ### q("no")
>
>
>
%%R
### Section 6.1 Case Study: Quantitative Structure- Activity
### Relationship Modeling
library(AppliedPredictiveModeling)
data(solubility)
library(lattice)
### Some initial plots of the data
print(
xyplot(solTrainY ~ solTrainX$MolWeight, type = c("p", "g"),
ylab = "Solubility (log)",
main = "(a)",
xlab = "Molecular Weight")
)
print(
xyplot(solTrainY ~ solTrainX$NumRotBonds, type = c("p", "g"),
ylab = "Solubility (log)",
xlab = "Number of Rotatable Bonds")
)
print(
bwplot(solTrainY ~ ifelse(solTrainX[,100] == 1,
"structure present",
"structure absent"),
ylab = "Solubility (log)",
main = "(b)",
horizontal = FALSE)
)
%%R
### Find the columns that are not fingerprints (i.e. the continuous
### predictors). grep will return a list of integers corresponding to
### column names that contain the pattern "FP".
notFingerprints <- grep("FP", names(solTrainXtrans))
library(caret)
print(
featurePlot(solTrainXtrans[, -notFingerprints],
solTrainY,
between = list(x = 1, y = 1),
type = c("g", "p", "smooth"),
labels = rep("", 2))
)
%%R
library(corrplot)
### We used the full namespace to call this function because the pls
### package (also used in this chapter) has a function with the same
### name.
corrplot::corrplot(cor(solTrainXtrans[, -notFingerprints]),
order = "hclust",
tl.cex = .8)
%%R
### Section 6.2 Linear Regression
### Create a control function that will be used across models. We
### create the fold assignments explicitly instead of relying on the
### random number seed being set to identical values.
set.seed(100)
indx <- createFolds(solTrainY, returnTrain = TRUE)
ctrl <- trainControl(method = "cv", index = indx)
### Linear regression model with all of the predictors. This will
### produce some warnings that a 'rank-deficient fit may be
### misleading'. This is related to the predictors being so highly
### correlated that some of the math has broken down.
set.seed(100)
lmTune0 <- train(x = solTrainXtrans, y = solTrainY,
method = "lm",
trControl = ctrl)
print(
lmTune0
)
### And another using a set of predictors reduced by unsupervised
### filtering. We apply a filter to reduce extreme between-predictor
### correlations. Note the lack of warnings.
tooHigh <- findCorrelation(cor(solTrainXtrans), .9)
trainXfiltered <- solTrainXtrans[, -tooHigh]
testXfiltered <- solTestXtrans[, -tooHigh]
set.seed(100)
lmTune <- train(x = trainXfiltered, y = solTrainY,
method = "lm",
trControl = ctrl)
print(
lmTune
)
### Save the test set results in a data frame
testResults <- data.frame(obs = solTestY,
Linear_Regression = predict(lmTune, testXfiltered))
Linear Regression 951 samples 228 predictors No pre-processing Resampling: Cross-Validated (10 fold) Summary of sample sizes: 856, 857, 855, 856, 856, 855, ... Resampling results RMSE Rsquared RMSE SD Rsquared SD 0.7210355 0.8768359 0.06998223 0.02467069 Linear Regression 951 samples 190 predictors No pre-processing Resampling: Cross-Validated (10 fold) Summary of sample sizes: 856, 857, 855, 856, 856, 855, ... Resampling results RMSE Rsquared RMSE SD Rsquared SD 0.7113935 0.8793396 0.06320545 0.02434305
%%R
### Section 6.3 Partial Least Squares
## Run PLS and PCR on solubility data and compare results
set.seed(100)
plsTune <- train(x = solTrainXtrans, y = solTrainY,
method = "pls",
tuneGrid = expand.grid(ncomp = 1:20),
trControl = ctrl)
print(
plsTune
)
testResults$PLS <- predict(plsTune, solTestXtrans)
set.seed(100)
pcrTune <- train(x = solTrainXtrans, y = solTrainY,
method = "pcr",
tuneGrid = expand.grid(ncomp = 1:35),
trControl = ctrl)
print(
pcrTune
)
Partial Least Squares 951 samples 228 predictors No pre-processing Resampling: Cross-Validated (10 fold) Summary of sample sizes: 856, 857, 855, 856, 856, 855, ... Resampling results across tuning parameters: ncomp RMSE Rsquared RMSE SD Rsquared SD 1 1.7543811 0.2630495 0.08396462 0.06500848 2 1.2720647 0.6128490 0.07938883 0.05345622 3 1.0373646 0.7432147 0.07155432 0.02761174 4 0.8370618 0.8317217 0.05615036 0.02574808 5 0.7458318 0.8660461 0.03778846 0.01932122 6 0.7106591 0.8779019 0.03432693 0.02281696 7 0.6921293 0.8841448 0.03794937 0.02403533 8 0.6908481 0.8851647 0.03282238 0.01967729 9 0.6828771 0.8877056 0.02910576 0.01851863 10 0.6824521 0.8879195 0.03050242 0.01870212 11 0.6826719 0.8878955 0.02914169 0.01953986 12 0.6847473 0.8872488 0.03726823 0.01936983 13 0.6836698 0.8875568 0.03972887 0.01935437 14 0.6856134 0.8871389 0.03984337 0.01855409 15 0.6867190 0.8869351 0.04224044 0.01944079 16 0.6860797 0.8872705 0.04359318 0.02079411 17 0.6881636 0.8866078 0.04626247 0.02130103 18 0.6926077 0.8853743 0.04810637 0.02213141 19 0.6943936 0.8848611 0.04858541 0.02206531 20 0.6977396 0.8837453 0.05295825 0.02247232 RMSE was used to select the optimal model using the smallest value. The final value used for the model was ncomp = 10. Principal Component Analysis 951 samples 228 predictors No pre-processing Resampling: Cross-Validated (10 fold) Summary of sample sizes: 856, 857, 855, 856, 856, 855, ... Resampling results across tuning parameters: ncomp RMSE Rsquared RMSE SD Rsquared SD 1 1.9778920 0.06590758 0.11043847 0.03465612 2 1.6379400 0.36202127 0.09825075 0.08480717 3 1.3655645 0.55546442 0.09395858 0.04528156 4 1.3715028 0.55157507 0.09810878 0.04757889 5 1.3415864 0.57099834 0.10467614 0.06166222 6 1.2081745 0.64973828 0.08788513 0.06148380 7 1.1818622 0.66578017 0.10108519 0.06050609 8 1.1452119 0.68759737 0.07782801 0.04078188 9 1.0495852 0.73655117 0.08201882 0.03697880 10 1.0063822 0.75723962 0.09589129 0.04169283 11 0.9723334 0.77443568 0.07775156 0.02843482 12 0.9692845 0.77566291 0.07887512 0.02905775 13 0.9526792 0.78316647 0.07637597 0.02724077 14 0.9396590 0.78895459 0.07056722 0.02444445 15 0.9419390 0.78796957 0.06837934 0.02414867 16 0.8695211 0.81842614 0.04668856 0.02511778 17 0.8699482 0.81825536 0.04575858 0.02485892 18 0.8719274 0.81723654 0.04753794 0.02576886 19 0.8695726 0.81824845 0.04727016 0.02659831 20 0.8682556 0.81894961 0.04730875 0.02681389 21 0.8096228 0.84189134 0.04576547 0.02447005 22 0.8122517 0.84082141 0.04477924 0.02426518 23 0.8093641 0.84200427 0.04457044 0.02513324 24 0.8096163 0.84210474 0.04011203 0.02327652 25 0.8095766 0.84208293 0.03900307 0.02355872 26 0.8049366 0.84421798 0.03676154 0.02129394 27 0.8039803 0.84465744 0.03378393 0.02036649 28 0.8056953 0.84397657 0.03395966 0.02100737 29 0.7863312 0.85146390 0.03603401 0.01889728 30 0.7819408 0.85271068 0.03068473 0.02057117 31 0.7795830 0.85355495 0.02832846 0.02096832 32 0.7757032 0.85503975 0.03571378 0.02166955 33 0.7395733 0.86853408 0.03063334 0.01813624 34 0.7327021 0.87065692 0.03102043 0.02117680 35 0.7307134 0.87142813 0.03570471 0.02195190 RMSE was used to select the optimal model using the smallest value. The final value used for the model was ncomp = 35.
%%R
plsResamples <- plsTune$results
plsResamples$Model <- "PLS"
pcrResamples <- pcrTune$results
pcrResamples$Model <- "PCR"
plsPlotData <- rbind(plsResamples, pcrResamples)
print(
xyplot(RMSE ~ ncomp,
data = plsPlotData,
#aspect = 1,
xlab = "# Components",
ylab = "RMSE (Cross-Validation)",
auto.key = list(columns = 2),
groups = Model,
type = c("o", "g"))
)
plsImp <- varImp(plsTune, scale = FALSE)
plot(plsImp, top = 25, scales = list(y = list(cex = .95)))
%%R
### Section 6.4 Penalized Models
## The text used the elasticnet to obtain a ridge regression model.
## There is now a simple ridge regression method.
ridgeGrid <- expand.grid(lambda = seq(0, .1, length = 15))
set.seed(100)
ridgeTune <- train(x = solTrainXtrans, y = solTrainY,
method = "ridge",
tuneGrid = ridgeGrid,
trControl = ctrl,
preProc = c("center", "scale"))
print(
ridgeTune
)
print(update(plot(ridgeTune), xlab = "Penalty"))
Loading required package: elasticnet Loading required package: lars Loaded lars 1.2 Ridge Regression 951 samples 228 predictors Pre-processing: centered, scaled Resampling: Cross-Validated (10 fold) Summary of sample sizes: 856, 857, 855, 856, 856, 855, ... Resampling results across tuning parameters: lambda RMSE Rsquared RMSE SD Rsquared SD 0.000000000 0.7207117 0.8769717 0.06994063 0.02450628 0.007142857 0.7047552 0.8818659 0.04495581 0.01988253 0.014285714 0.6964731 0.8847911 0.04051497 0.01867276 0.021428571 0.6925923 0.8862699 0.03781419 0.01797165 0.028571429 0.6908607 0.8870609 0.03593594 0.01748178 0.035714286 0.6904220 0.8874561 0.03457159 0.01710886 0.042857143 0.6908548 0.8875998 0.03357310 0.01681167 0.050000000 0.6919207 0.8875741 0.03285297 0.01656815 0.057142857 0.6934783 0.8874278 0.03234969 0.01636278 0.064285714 0.6954114 0.8872009 0.03202921 0.01619286 0.071428571 0.6976723 0.8869096 0.03185067 0.01604581 0.078571429 0.7002069 0.8865723 0.03179153 0.01591906 0.085714286 0.7029801 0.8862009 0.03183151 0.01580906 0.092857143 0.7059656 0.8858041 0.03195417 0.01571305 0.100000000 0.7091432 0.8853885 0.03214610 0.01562886 RMSE was used to select the optimal model using the smallest value. The final value used for the model was lambda = 0.03571429.
%%R
enetGrid <- expand.grid(lambda = c(0, 0.01, .1),
fraction = seq(.05, 1, length = 20))
set.seed(100)
enetTune <- train(x = solTrainXtrans, y = solTrainY,
method = "enet",
tuneGrid = enetGrid,
trControl = ctrl,
preProc = c("center", "scale"))
print(
enetTune
)
print(
plot(enetTune)
)
testResults$Enet <- predict(enetTune, solTestXtrans)
Elasticnet 951 samples 228 predictors Pre-processing: centered, scaled Resampling: Cross-Validated (10 fold) Summary of sample sizes: 856, 857, 855, 856, 856, 855, ... Resampling results across tuning parameters: lambda fraction RMSE Rsquared RMSE SD Rsquared SD 0.00 0.05 0.8713747 0.8337289 0.03816148 0.02737681 0.00 0.10 0.6882637 0.8858786 0.04298815 0.02064030 0.00 0.15 0.6729264 0.8907993 0.03942228 0.01837582 0.00 0.20 0.6754697 0.8903865 0.03807506 0.01760700 0.00 0.25 0.6879252 0.8865202 0.04383623 0.01946378 0.00 0.30 0.6971062 0.8836414 0.04812788 0.02058289 0.00 0.35 0.7062274 0.8808469 0.05191262 0.02155822 0.00 0.40 0.7125900 0.8788942 0.05345207 0.02192952 0.00 0.45 0.7138742 0.8785588 0.05342746 0.02178996 0.00 0.50 0.7141235 0.8785622 0.05461747 0.02183522 0.00 0.55 0.7144669 0.8784961 0.05583323 0.02211744 0.00 0.60 0.7140532 0.8786593 0.05739702 0.02234513 0.00 0.65 0.7140599 0.8786880 0.05941448 0.02265512 0.00 0.70 0.7145464 0.8785744 0.06116481 0.02298579 0.00 0.75 0.7151011 0.8784348 0.06289926 0.02335653 0.00 0.80 0.7158067 0.8782629 0.06453350 0.02366829 0.00 0.85 0.7167918 0.8780158 0.06564865 0.02383283 0.00 0.90 0.7178711 0.8777467 0.06672370 0.02398923 0.00 0.95 0.7191448 0.8774055 0.06834509 0.02424302 0.00 1.00 0.7207117 0.8769717 0.06994063 0.02450628 0.01 0.05 1.5168857 0.6435177 0.11013983 0.07875588 0.01 0.10 1.1324481 0.7671388 0.07499369 0.04771971 0.01 0.15 0.9061843 0.8241043 0.05601707 0.02997353 0.01 0.20 0.7855269 0.8571170 0.04929439 0.02173949 0.01 0.25 0.7296380 0.8733531 0.04166558 0.02066970 0.01 0.30 0.6989522 0.8826020 0.04255257 0.02028148 0.01 0.35 0.6866513 0.8863490 0.04212287 0.01967040 0.01 0.40 0.6806730 0.8884346 0.03999669 0.01852187 0.01 0.45 0.6778780 0.8895285 0.03610764 0.01717676 0.01 0.50 0.6760780 0.8902871 0.03307570 0.01620142 0.01 0.55 0.6743998 0.8909724 0.03065386 0.01569024 0.01 0.60 0.6746777 0.8910026 0.03042481 0.01580700 0.01 0.65 0.6765522 0.8904906 0.03177438 0.01642381 0.01 0.70 0.6796775 0.8895768 0.03364893 0.01711767 0.01 0.75 0.6829651 0.8886182 0.03551058 0.01757998 0.01 0.80 0.6862396 0.8876472 0.03719803 0.01791970 0.01 0.85 0.6895735 0.8866477 0.03885651 0.01822379 0.01 0.90 0.6930103 0.8856210 0.04047457 0.01858065 0.01 0.95 0.6968398 0.8844630 0.04181671 0.01895729 0.01 1.00 0.7006283 0.8833050 0.04284382 0.01929610 0.10 0.05 1.6867967 0.5157969 0.13154407 0.08882307 0.10 0.10 1.4058744 0.6954146 0.10735405 0.06584337 0.10 0.15 1.1697385 0.7596795 0.08648027 0.04623881 0.10 0.20 1.0082617 0.7880698 0.06594126 0.03758966 0.10 0.25 0.8950440 0.8218825 0.05827006 0.02812113 0.10 0.30 0.8193443 0.8435444 0.05167792 0.02222192 0.10 0.35 0.7744593 0.8570276 0.04722049 0.02081488 0.10 0.40 0.7519611 0.8644826 0.04182081 0.01957350 0.10 0.45 0.7343282 0.8710631 0.03806132 0.01874198 0.10 0.50 0.7245543 0.8750318 0.03539926 0.01842909 0.10 0.55 0.7180823 0.8778937 0.03288742 0.01794844 0.10 0.60 0.7137901 0.8799906 0.03184857 0.01756183 0.10 0.65 0.7110967 0.8815343 0.03100037 0.01695475 0.10 0.70 0.7104058 0.8823940 0.02973462 0.01635597 0.10 0.75 0.7103284 0.8829674 0.02952904 0.01597719 0.10 0.80 0.7097899 0.8836319 0.03000022 0.01578241 0.10 0.85 0.7093246 0.8842290 0.03064013 0.01567030 0.10 0.90 0.7094949 0.8845954 0.03109508 0.01554030 0.10 0.95 0.7094181 0.8849823 0.03169989 0.01554197 0.10 1.00 0.7091432 0.8853885 0.03214610 0.01562886 RMSE was used to select the optimal model using the smallest value. The final values used for the model were fraction = 0.15 and lambda = 0.
%%R
showChapterScript(7)
NULL
%%R
showChapterOutput(7)
NULL
%%R -w 600 -h 600
## runChapterScript(7)
## user system elapsed
## 112106.723 188.979 12272.168
NULL
%%R
showChapterScript(8)
NULL
%%R
showChapterOutput(8)
NULL
%%R -w 600 -h 600
## runChapterScript(8)
## user system elapsed
## 21280.849 500.609 6798.887
NULL
%%R
showChapterScript(10)
NULL
%%R
showChapterOutput(10)
R Information
R version 3.0.1 (2013-05-16) -- "Good Sport"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin10.8.0 (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> ################################################################################
> ### R code from Applied Predictive Modeling (2013) by Kuhn and Johnson.
> ### Copyright 2013 Kuhn and Johnson
> ### Web Page: http://www.appliedpredictivemodeling.com
> ### Contact: Max Kuhn (mxkuhn@gmail.com)
> ###
> ### Chapter 10: Case Study: Compressive Strength of Concrete Mixtures
> ###
> ### Required packages: AppliedPredictiveModeling, caret, Cubist, doMC (optional),
> ### earth, elasticnet, gbm, ipred, lattice, nnet, party, pls,
> ### randomForests, rpart, RWeka
> ###
> ### Data used: The concrete from the AppliedPredictiveModeling package
> ###
> ### Notes:
> ### 1) This code is provided without warranty.
> ###
> ### 2) This code should help the user reproduce the results in the
> ### text. There will be differences between this code and what is is
> ### the computing section. For example, the computing sections show
> ### how the source functions work (e.g. randomForest() or plsr()),
> ### which were not directly used when creating the book. Also, there may be
> ### syntax differences that occur over time as packages evolve. These files
> ### will reflect those changes.
> ###
> ### 3) In some cases, the calculations in the book were run in
> ### parallel. The sub-processes may reset the random number seed.
> ### Your results may slightly vary.
> ###
> ################################################################################
>
> ################################################################################
> ### Load the data and plot the data
>
> library(AppliedPredictiveModeling)
> data(concrete)
>
> library(caret)
Loading required package: lattice
Loading required package: ggplot2
> library(plyr)
>
> featurePlot(concrete[, -9], concrete$CompressiveStrength,
+ between = list(x = 1, y = 1),
+ type = c("g", "p", "smooth"))
>
>
> ################################################################################
> ### Section 10.1 Model Building Strategy
> ### There are replicated mixtures, so take the average per mixture
>
> averaged <- ddply(mixtures,
+ .(Cement, BlastFurnaceSlag, FlyAsh, Water,
+ Superplasticizer, CoarseAggregate,
+ FineAggregate, Age),
+ function(x) c(CompressiveStrength =
+ mean(x$CompressiveStrength)))
>
> ### Split the data and create a control object for train()
>
> set.seed(975)
> inTrain <- createDataPartition(averaged$CompressiveStrength, p = 3/4)[[1]]
> training <- averaged[ inTrain,]
> testing <- averaged[-inTrain,]
>
> ctrl <- trainControl(method = "repeatedcv", repeats = 5, number = 10)
>
> ### Create a model formula that can be used repeatedly
>
> modForm <- paste("CompressiveStrength ~ (.)^2 + I(Cement^2) + I(BlastFurnaceSlag^2) +",
+ "I(FlyAsh^2) + I(Water^2) + I(Superplasticizer^2) +",
+ "I(CoarseAggregate^2) + I(FineAggregate^2) + I(Age^2)")
> modForm <- as.formula(modForm)
>
> ### Fit the various models
>
> ### Optional: parallel processing can be used via the 'do' packages,
> ### such as doMC, doMPI etc. We used doMC (not on Windows) to speed
> ### up the computations.
>
> ### WARNING: Be aware of how much memory is needed to parallel
> ### process. It can very quickly overwhelm the available hardware. The
> ### estimate of the median memory usage (VSIZE = total memory size)
> ### was 2800M for a core although the M5 calculations require about
> ### 3700M without parallel processing.
>
> ### WARNING 2: The RWeka package does not work well with some forms of
> ### parallel processing, such as mutlicore (i.e. doMC).
>
> library(doMC)
Loading required package: foreach
Loading required package: iterators
Loading required package: parallel
> registerDoMC(14)
>
> set.seed(669)
> lmFit <- train(modForm, data = training,
+ method = "lm",
+ trControl = ctrl)
>
> set.seed(669)
> plsFit <- train(modForm, data = training,
+ method = "pls",
+ preProc = c("center", "scale"),
+ tuneLength = 15,
+ trControl = ctrl)
Loading required package: pls
Attaching package: ‘pls’
The following object is masked from ‘package:caret’:
R2
The following object is masked from ‘package:stats’:
loadings
>
> lassoGrid <- expand.grid(lambda = c(0, .001, .01, .1),
+ fraction = seq(0.05, 1, length = 20))
> set.seed(669)
> lassoFit <- train(modForm, data = training,
+ method = "enet",
+ preProc = c("center", "scale"),
+ tuneGrid = lassoGrid,
+ trControl = ctrl)
Loading required package: elasticnet
Loading required package: lars
Loaded lars 1.1
>
> set.seed(669)
> earthFit <- train(CompressiveStrength ~ ., data = training,
+ method = "earth",
+ tuneGrid = expand.grid(degree = 1,
+ nprune = 2:25),
+ trControl = ctrl)
Loading required package: earth
Loading required package: leaps
Loading required package: plotmo
Loading required package: plotrix
>
> set.seed(669)
> svmRFit <- train(CompressiveStrength ~ ., data = training,
+ method = "svmRadial",
+ tuneLength = 15,
+ preProc = c("center", "scale"),
+ trControl = ctrl)
Loading required package: kernlab
>
>
> nnetGrid <- expand.grid(decay = c(0.001, .01, .1),
+ size = seq(1, 27, by = 2),
+ bag = FALSE)
> set.seed(669)
> nnetFit <- train(CompressiveStrength ~ .,
+ data = training,
+ method = "avNNet",
+ tuneGrid = nnetGrid,
+ preProc = c("center", "scale"),
+ linout = TRUE,
+ trace = FALSE,
+ maxit = 1000,
+ allowParallel = FALSE,
+ trControl = ctrl)
Loading required package: nnet
>
> set.seed(669)
> rpartFit <- train(CompressiveStrength ~ .,
+ data = training,
+ method = "rpart",
+ tuneLength = 30,
+ trControl = ctrl)
Loading required package: rpart
Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
>
> set.seed(669)
> treebagFit <- train(CompressiveStrength ~ .,
+ data = training,
+ method = "treebag",
+ trControl = ctrl)
Loading required package: ipred
Loading required package: MASS
Loading required package: survival
Loading required package: splines
Attaching package: ‘survival’
The following object is masked from ‘package:caret’:
cluster
Loading required package: class
Loading required package: prodlim
KernSmooth 2.23 loaded
Copyright M. P. Wand 1997-2009
>
> set.seed(669)
> ctreeFit <- train(CompressiveStrength ~ .,
+ data = training,
+ method = "ctree",
+ tuneLength = 10,
+ trControl = ctrl)
Loading required package: party
Loading required package: grid
Loading required package: modeltools
Loading required package: stats4
Attaching package: ‘modeltools’
The following object is masked from ‘package:kernlab’:
prior
The following object is masked from ‘package:plyr’:
empty
Loading required package: coin
Loading required package: mvtnorm
Loading required package: zoo
Attaching package: ‘zoo’
The following object is masked from ‘package:base’:
as.Date, as.Date.numeric
Loading required package: sandwich
Loading required package: strucchange
Loading required package: vcd
Loading required package: colorspace
>
> set.seed(669)
> rfFit <- train(CompressiveStrength ~ .,
+ data = training,
+ method = "rf",
+ tuneLength = 10,
+ ntrees = 1000,
+ importance = TRUE,
+ trControl = ctrl)
Loading required package: randomForest
randomForest 4.6-7
Type rfNews() to see new features/changes/bug fixes.
note: only 7 unique complexity parameters in default grid. Truncating the grid to 7 .
>
>
> gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
+ n.trees = seq(100, 1000, by = 50),
+ shrinkage = c(0.01, 0.1))
> set.seed(669)
> gbmFit <- train(CompressiveStrength ~ .,
+ data = training,
+ method = "gbm",
+ tuneGrid = gbmGrid,
+ verbose = FALSE,
+ trControl = ctrl)
Loading required package: gbm
Loaded gbm 2.1
>
>
> cbGrid <- expand.grid(committees = c(1, 5, 10, 50, 75, 100),
+ neighbors = c(0, 1, 3, 5, 7, 9))
> set.seed(669)
> cbFit <- train(CompressiveStrength ~ .,
+ data = training,
+ method = "cubist",
+ tuneGrid = cbGrid,
+ trControl = ctrl)
Loading required package: Cubist
Loading required package: reshape2
>
> ### Turn off the parallel processing to use RWeka.
> registerDoSEQ()
>
>
> set.seed(669)
> mtFit <- train(CompressiveStrength ~ .,
+ data = training,
+ method = "M5",
+ trControl = ctrl)
Loading required package: RWeka
Warning message:
In train.default(x, y, weights = w, ...) :
Models using Weka will not work with parallel processing with multicore/doMC
>
> ################################################################################
> ### Section 10.2 Model Performance
>
> ### Collect the resampling statistics across all the models
>
> rs <- resamples(list("Linear Reg" = lmFit, "
+ PLS" = plsFit,
+ "Elastic Net" = lassoFit,
+ MARS = earthFit,
+ SVM = svmRFit,
+ "Neural Networks" = nnetFit,
+ CART = rpartFit,
+ "Cond Inf Tree" = ctreeFit,
+ "Bagged Tree" = treebagFit,
+ "Boosted Tree" = gbmFit,
+ "Random Forest" = rfFit,
+ Cubist = cbFit))
>
> #parallelPlot(rs)
> #parallelPlot(rs, metric = "Rsquared")
>
> ### Get the test set results across several models
>
> nnetPred <- predict(nnetFit, testing)
> gbmPred <- predict(gbmFit, testing)
> cbPred <- predict(cbFit, testing)
>
> testResults <- rbind(postResample(nnetPred, testing$CompressiveStrength),
+ postResample(gbmPred, testing$CompressiveStrength),
+ postResample(cbPred, testing$CompressiveStrength))
> testResults <- as.data.frame(testResults)
> testResults$Model <- c("Neural Networks", "Boosted Tree", "Cubist")
> testResults <- testResults[order(testResults$RMSE),]
>
> ################################################################################
> ### Section 10.3 Optimizing Compressive Strength
>
> library(proxy)
Attaching package: ‘proxy’
The following object is masked from ‘package:stats’:
as.dist, dist
>
> ### Create a function to maximize compressive strength* while keeping
> ### the predictor values as mixtures. Water (in x[7]) is used as the
> ### 'slack variable'.
>
> ### * We are actually minimizing the negative compressive strength
>
> modelPrediction <- function(x, mod, limit = 2500)
+ {
+ if(x[1] < 0 | x[1] > 1) return(10^38)
+ if(x[2] < 0 | x[2] > 1) return(10^38)
+ if(x[3] < 0 | x[3] > 1) return(10^38)
+ if(x[4] < 0 | x[4] > 1) return(10^38)
+ if(x[5] < 0 | x[5] > 1) return(10^38)
+ if(x[6] < 0 | x[6] > 1) return(10^38)
+
+ x <- c(x, 1 - sum(x))
+
+ if(x[7] < 0.05) return(10^38)
+
+ tmp <- as.data.frame(t(x))
+ names(tmp) <- c('Cement','BlastFurnaceSlag','FlyAsh',
+ 'Superplasticizer','CoarseAggregate',
+ 'FineAggregate', 'Water')
+ tmp$Age <- 28
+ -predict(mod, tmp)
+ }
>
> ### Get mixtures at 28 days
> subTrain <- subset(training, Age == 28)
>
> ### Center and scale the data to use dissimilarity sampling
> pp1 <- preProcess(subTrain[, -(8:9)], c("center", "scale"))
> scaledTrain <- predict(pp1, subTrain[, 1:7])
>
> ### Randomly select a few mixtures as a starting pool
>
> set.seed(91)
> startMixture <- sample(1:nrow(subTrain), 1)
> starters <- scaledTrain[startMixture, 1:7]
> pool <- scaledTrain
> index <- maxDissim(starters, pool, 14)
> startPoints <- c(startMixture, index)
>
> starters <- subTrain[startPoints,1:7]
> startingValues <- starters[, -4]
>
> ### For each starting mixture, optimize the Cubist model using
> ### a simplex search routine
>
> cbResults <- startingValues
> cbResults$Water <- NA
> cbResults$Prediction <- NA
>
> for(i in 1:nrow(cbResults))
+ {
+ results <- optim(unlist(cbResults[i,1:6]),
+ modelPrediction,
+ method = "Nelder-Mead",
+ control=list(maxit=5000),
+ mod = cbFit)
+ cbResults$Prediction[i] <- -results$value
+ cbResults[i,1:6] <- results$par
+ }
> cbResults$Water <- 1 - apply(cbResults[,1:6], 1, sum)
> cbResults <- subset(cbResults, Prediction > 0 & Water > .02)
> cbResults <- cbResults[order(-cbResults$Prediction),][1:3,]
> cbResults$Model <- "Cubist"
>
> ### Do the same for the neural network model
>
> nnetResults <- startingValues
> nnetResults$Water <- NA
> nnetResults$Prediction <- NA
>
> for(i in 1:nrow(nnetResults))
+ {
+ results <- optim(unlist(nnetResults[i, 1:6,]),
+ modelPrediction,
+ method = "Nelder-Mead",
+ control=list(maxit=5000),
+ mod = nnetFit)
+ nnetResults$Prediction[i] <- -results$value
+ nnetResults[i,1:6] <- results$par
+ }
> nnetResults$Water <- 1 - apply(nnetResults[,1:6], 1, sum)
> nnetResults <- subset(nnetResults, Prediction > 0 & Water > .02)
> nnetResults <- nnetResults[order(-nnetResults$Prediction),][1:3,]
> nnetResults$Model <- "NNet"
>
> ### Convert the predicted mixtures to PCA space and plot
>
> pp2 <- preProcess(subTrain[, 1:7], "pca")
> pca1 <- predict(pp2, subTrain[, 1:7])
> pca1$Data <- "Training Set"
> pca1$Data[startPoints] <- "Starting Values"
> pca3 <- predict(pp2, cbResults[, names(subTrain[, 1:7])])
> pca3$Data <- "Cubist"
> pca4 <- predict(pp2, nnetResults[, names(subTrain[, 1:7])])
> pca4$Data <- "Neural Network"
>
> pcaData <- rbind(pca1, pca3, pca4)
> pcaData$Data <- factor(pcaData$Data,
+ levels = c("Training Set","Starting Values",
+ "Cubist","Neural Network"))
>
> lim <- extendrange(pcaData[, 1:2])
>
> xyplot(PC2 ~ PC1,
+ data = pcaData,
+ groups = Data,
+ auto.key = list(columns = 2),
+ xlim = lim,
+ ylim = lim,
+ type = c("g", "p"))
>
>
> ################################################################################
> ### Session Information
>
> sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats4 grid splines parallel stats graphics grDevices
[8] utils datasets methods base
other attached packages:
[1] proxy_0.4-9 RWeka_0.4-15
[3] Cubist_0.0.13 reshape2_1.2.2
[5] gbm_2.1 randomForest_4.6-7
[7] party_1.0-6 vcd_1.2-13
[9] colorspace_1.2-1 strucchange_1.4-7
[11] sandwich_2.2-9 zoo_1.7-9
[13] coin_1.0-21 mvtnorm_0.9-9994
[15] modeltools_0.2-19 ipred_0.9-1
[17] prodlim_1.3.3 class_7.3-7
[19] survival_2.37-4 MASS_7.3-26
[21] rpart_4.1-1 nnet_7.3-6
[23] kernlab_0.9-16 earth_3.2-3
[25] plotrix_3.4-6 plotmo_1.3-2
[27] leaps_2.9 elasticnet_1.1
[29] lars_1.1 pls_2.3-0
[31] doMC_1.3.0 iterators_1.0.6
[33] foreach_1.4.0 plyr_1.8
[35] caret_6.0-22 ggplot2_0.9.3.1
[37] lattice_0.20-15 AppliedPredictiveModeling_1.1-5
loaded via a namespace (and not attached):
[1] car_2.0-16 codetools_0.2-8 compiler_3.0.1 CORElearn_0.9.41
[5] dichromat_2.0-0 digest_0.6.3 gtable_0.1.2 KernSmooth_2.23-10
[9] labeling_0.1 munsell_0.4 proto_0.3-10 RColorBrewer_1.0-5
[13] rJava_0.9-4 RWekajars_3.7.8-1 scales_0.2.3 stringr_0.6.2
[17] tools_3.0.1
>
> q("no")
> proc.time()
user system elapsed
20277.196 121.470 4043.395
%%R
# Try this if you are very patient --
# in the APM version of the output file:
############## THE RUN TIME FOR THIS SCRIPT IS LISTED AS 5.6 HOURS.
# Chs 10 and 17 evaluate many different models in case studies.
# To run the Ch.10 script:
VERY_PATIENT = FALSE
if (VERY_PATIENT) {
current_working_directory = getwd() # remember current directory
chapter_code_directory = scriptLocation()
setwd( chapter_code_directory )
print(dir())
print(source("10_Case_Study_Concrete.R", echo=TRUE))
setwd(current_working_directory) # return to working directory
}
## user system elapsed
## 20277.196 121.470 4043.395
%%R
showChapterScript(11)
NULL
%%R
showChapterOutput(11)
NULL
%%R -w 600 -h 600
runChapterScript(11)
## user system elapsed
## 11.120 0.526 11.698
NULL
%%R
### Section 11.1 Class Predictions
library(AppliedPredictiveModeling)
### Simulate some two class data with two predictors
set.seed(975)
training <- quadBoundaryFunc(500)
testing <- quadBoundaryFunc(1000)
testing$class2 <- ifelse(testing$class == "Class1", 1, 0)
testing$ID <- 1:nrow(testing)
### Fit models
library(MASS)
qdaFit <- qda(class ~ X1 + X2, data = training)
library(randomForest)
rfFit <- randomForest(class ~ X1 + X2, data = training, ntree = 2000)
### Predict the test set
testing$qda <- predict(qdaFit, testing)$posterior[,1]
testing$rf <- predict(rfFit, testing, type = "prob")[,1]
### Generate the calibration analysis
library(caret)
calData1 <- calibration(class ~ qda + rf, data = testing, cuts = 10)
### Plot the curve
print(
xyplot(calData1, auto.key = list(columns = 2))
)
randomForest 4.6-10 Type rfNews() to see new features/changes/bug fixes.
%%R
### To calibrate the data, treat the probabilities as inputs into the
### model
trainProbs <- training
trainProbs$qda <- predict(qdaFit)$posterior[,1]
### These models take the probabilities as inputs and, based on the
### true class, re-calibrate them.
library(klaR)
nbCal <- NaiveBayes(class ~ qda, data = trainProbs, usekernel = TRUE)
### We use relevel() here because glm() models the probability of the
### second factor level.
lrCal <- glm(relevel(class, "Class2") ~ qda, data = trainProbs, family = binomial)
### Now re-predict the test set using the modified class probability
### estimates
testing$qda2 <- predict(nbCal, testing[, "qda", drop = FALSE])$posterior[,1]
testing$qda3 <- predict(lrCal, testing[, "qda", drop = FALSE], type = "response")
### Manipulate the data a bit for pretty plotting
simulatedProbs <- testing[, c("class", "rf", "qda3")]
names(simulatedProbs) <- c("TrueClass", "RandomForestProb", "QDACalibrated")
simulatedProbs$RandomForestClass <- predict(rfFit, testing)
calData2 <- calibration(class ~ qda + qda2 + qda3, data = testing)
calData2$data$calibModelVar <- as.character(calData2$data$calibModelVar)
calData2$data$calibModelVar <- ifelse(calData2$data$calibModelVar == "qda",
"QDA",
calData2$data$calibModelVar)
calData2$data$calibModelVar <- ifelse(calData2$data$calibModelVar == "qda2",
"Bayesian Calibration",
calData2$data$calibModelVar)
calData2$data$calibModelVar <- ifelse(calData2$data$calibModelVar == "qda3",
"Sigmoidal Calibration",
calData2$data$calibModelVar)
calData2$data$calibModelVar <- factor(calData2$data$calibModelVar,
levels = c("QDA",
"Bayesian Calibration",
"Sigmoidal Calibration"))
print(
xyplot(calData2, auto.key = list(columns = 1))
)
%%R
## These commands are needed to reload GermanCredit, which is changed by this and Ch.4 code:
detach(package:caret)
library(caret)
data(GermanCredit)
## First, remove near-zero variance predictors then get rid of a few predictors
## that duplicate values. For example, there are two possible values for the
## housing variable: "Rent", "Own" and "ForFree". So that we don't have linear
## dependencies, we get rid of one of the levels (e.g. "ForFree")
GermanCredit <- GermanCredit[, -nearZeroVar(GermanCredit)]
GermanCredit$CheckingAccountStatus.lt.0 <- NULL
GermanCredit$SavingsAccountBonds.lt.100 <- NULL
GermanCredit$EmploymentDuration.lt.1 <- NULL
GermanCredit$EmploymentDuration.Unemployed <- NULL
GermanCredit$Personal.Male.Married.Widowed <- NULL
GermanCredit$Property.Unknown <- NULL
GermanCredit$Housing.ForFree <- NULL
## Split the data into training (80%) and test sets (20%)
set.seed(100)
inTrain <- createDataPartition(GermanCredit$Class, p = .8)[[1]]
GermanCreditTrain <- GermanCredit[ inTrain, ]
GermanCreditTest <- GermanCredit[-inTrain, ]
set.seed(1056)
logisticReg <- train(Class ~ .,
data = GermanCreditTrain,
method = "glm",
trControl = trainControl(method = "repeatedcv",
repeats = 5))
print(
logisticReg
)
### Predict the test set
creditResults <- data.frame(obs = GermanCreditTest$Class)
creditResults$prob <- predict(logisticReg, GermanCreditTest, type = "prob")[, "Bad"]
creditResults$pred <- predict(logisticReg, GermanCreditTest)
creditResults$Label <- ifelse(creditResults$obs == "Bad",
"True Outcome: Bad Credit",
"True Outcome: Good Credit")
### Plot the probability of bad credit
print(
histogram(~prob|Label,
data = creditResults,
layout = c(2, 1),
nint = 20,
xlab = "Probability of Bad Credit",
type = "count")
)
### Calculate and plot the calibration curve
creditCalib <- calibration(obs ~ prob, data = creditResults)
print(
xyplot(creditCalib)
)
### Create the confusion matrix from the test set.
print(
confusionMatrix(data = creditResults$pred,
reference = creditResults$obs)
)
Generalized Linear Model
800 samples
41 predictor
2 classes: 'Bad', 'Good'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 720, 720, 720, 720, 720, 720, ...
Resampling results
Accuracy Kappa Accuracy SD Kappa SD
0.749 0.3647664 0.05162166 0.1218109
Confusion Matrix and Statistics
Reference
Prediction Bad Good
Bad 24 10
Good 36 130
Accuracy : 0.77
95% CI : (0.7054, 0.8264)
No Information Rate : 0.7
P-Value [Acc > NIR] : 0.0168694
Kappa : 0.375
Mcnemar's Test P-Value : 0.0002278
Sensitivity : 0.4000
Specificity : 0.9286
Pos Pred Value : 0.7059
Neg Pred Value : 0.7831
Prevalence : 0.3000
Detection Rate : 0.1200
Detection Prevalence : 0.1700
Balanced Accuracy : 0.6643
'Positive' Class : Bad
%%R
### ROC curves:
### Like glm(), roc() treats the last level of the factor as the event
### of interest so we use relevel() to change the observed class data
library(pROC)
creditROC <- roc(relevel(creditResults$obs, "Good"), creditResults$prob)
coords(creditROC, "all")[,1:3]
print(
auc(creditROC)
)
print(
ci.auc(creditROC)
)
### Note the x-axis is reversed
plot(creditROC)
### Old-school:
plot(creditROC, legacy.axes = TRUE)
### Lift charts
creditLift <- lift(obs ~ prob, data = creditResults)
print(
xyplot(creditLift)
)
Area under the curve: 0.775 95% CI: 0.7032-0.8468 (DeLong)
%%R
summary(GermanCredit)
Duration Amount InstallmentRatePercentage ResidenceDuration
Min. : 4.0 Min. : 250 Min. :1.000 Min. :1.000
1st Qu.:12.0 1st Qu.: 1366 1st Qu.:2.000 1st Qu.:2.000
Median :18.0 Median : 2320 Median :3.000 Median :3.000
Mean :20.9 Mean : 3271 Mean :2.973 Mean :2.845
3rd Qu.:24.0 3rd Qu.: 3972 3rd Qu.:4.000 3rd Qu.:4.000
Max. :72.0 Max. :18424 Max. :4.000 Max. :4.000
Age NumberExistingCredits NumberPeopleMaintenance Telephone
Min. :19.00 Min. :1.000 Min. :1.000 Min. :0.000
1st Qu.:27.00 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0.000
Median :33.00 Median :1.000 Median :1.000 Median :1.000
Mean :35.55 Mean :1.407 Mean :1.155 Mean :0.596
3rd Qu.:42.00 3rd Qu.:2.000 3rd Qu.:1.000 3rd Qu.:1.000
Max. :75.00 Max. :4.000 Max. :2.000 Max. :1.000
Class CheckingAccountStatus.0.to.200 CheckingAccountStatus.gt.200
Bad :300 Min. :0.000 Min. :0.000
Good:700 1st Qu.:0.000 1st Qu.:0.000
Median :0.000 Median :0.000
Mean :0.269 Mean :0.063
3rd Qu.:1.000 3rd Qu.:0.000
Max. :1.000 Max. :1.000
CheckingAccountStatus.none CreditHistory.PaidDuly CreditHistory.Delay
Min. :0.000 Min. :0.00 Min. :0.000
1st Qu.:0.000 1st Qu.:0.00 1st Qu.:0.000
Median :0.000 Median :1.00 Median :0.000
Mean :0.394 Mean :0.53 Mean :0.088
3rd Qu.:1.000 3rd Qu.:1.00 3rd Qu.:0.000
Max. :1.000 Max. :1.00 Max. :1.000
CreditHistory.Critical Purpose.NewCar Purpose.UsedCar
Min. :0.000 Min. :0.000 Min. :0.000
1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000
Median :0.000 Median :0.000 Median :0.000
Mean :0.293 Mean :0.234 Mean :0.103
3rd Qu.:1.000 3rd Qu.:0.000 3rd Qu.:0.000
Max. :1.000 Max. :1.000 Max. :1.000
Purpose.Furniture.Equipment Purpose.Radio.Television Purpose.Education
Min. :0.000 Min. :0.00 Min. :0.00
1st Qu.:0.000 1st Qu.:0.00 1st Qu.:0.00
Median :0.000 Median :0.00 Median :0.00
Mean :0.181 Mean :0.28 Mean :0.05
3rd Qu.:0.000 3rd Qu.:1.00 3rd Qu.:0.00
Max. :1.000 Max. :1.00 Max. :1.00
Purpose.Business SavingsAccountBonds.100.to.500
Min. :0.000 Min. :0.000
1st Qu.:0.000 1st Qu.:0.000
Median :0.000 Median :0.000
Mean :0.097 Mean :0.103
3rd Qu.:0.000 3rd Qu.:0.000
Max. :1.000 Max. :1.000
SavingsAccountBonds.500.to.1000 SavingsAccountBonds.Unknown
Min. :0.000 Min. :0.000
1st Qu.:0.000 1st Qu.:0.000
Median :0.000 Median :0.000
Mean :0.063 Mean :0.183
3rd Qu.:0.000 3rd Qu.:0.000
Max. :1.000 Max. :1.000
EmploymentDuration.1.to.4 EmploymentDuration.4.to.7 EmploymentDuration.gt.7
Min. :0.000 Min. :0.000 Min. :0.000
1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000
Median :0.000 Median :0.000 Median :0.000
Mean :0.339 Mean :0.174 Mean :0.253
3rd Qu.:1.000 3rd Qu.:0.000 3rd Qu.:1.000
Max. :1.000 Max. :1.000 Max. :1.000
Personal.Male.Divorced.Seperated Personal.Female.NotSingle
Min. :0.00 Min. :0.00
1st Qu.:0.00 1st Qu.:0.00
Median :0.00 Median :0.00
Mean :0.05 Mean :0.31
3rd Qu.:0.00 3rd Qu.:1.00
Max. :1.00 Max. :1.00
Personal.Male.Single OtherDebtorsGuarantors.None
Min. :0.000 Min. :0.000
1st Qu.:0.000 1st Qu.:1.000
Median :1.000 Median :1.000
Mean :0.548 Mean :0.907
3rd Qu.:1.000 3rd Qu.:1.000
Max. :1.000 Max. :1.000
OtherDebtorsGuarantors.Guarantor Property.RealEstate Property.Insurance
Min. :0.000 Min. :0.000 Min. :0.000
1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000
Median :0.000 Median :0.000 Median :0.000
Mean :0.052 Mean :0.282 Mean :0.232
3rd Qu.:0.000 3rd Qu.:1.000 3rd Qu.:0.000
Max. :1.000 Max. :1.000 Max. :1.000
Property.CarOther OtherInstallmentPlans.Bank OtherInstallmentPlans.None
Min. :0.000 Min. :0.000 Min. :0.000
1st Qu.:0.000 1st Qu.:0.000 1st Qu.:1.000
Median :0.000 Median :0.000 Median :1.000
Mean :0.332 Mean :0.139 Mean :0.814
3rd Qu.:1.000 3rd Qu.:0.000 3rd Qu.:1.000
Max. :1.000 Max. :1.000 Max. :1.000
Housing.Rent Housing.Own Job.UnskilledResident Job.SkilledEmployee
Min. :0.000 Min. :0.000 Min. :0.0 Min. :0.00
1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.0 1st Qu.:0.00
Median :0.000 Median :1.000 Median :0.0 Median :1.00
Mean :0.179 Mean :0.713 Mean :0.2 Mean :0.63
3rd Qu.:0.000 3rd Qu.:1.000 3rd Qu.:0.0 3rd Qu.:1.00
Max. :1.000 Max. :1.000 Max. :1.0 Max. :1.00
Job.Management.SelfEmp.HighlyQualified
Min. :0.000
1st Qu.:0.000
Median :0.000
Mean :0.148
3rd Qu.:0.000
Max. :1.000
%%R
showChapterScript(12)
NULL
%%R
showChapterOutput(12)
R Information
R version 3.0.1 (2013-05-16) -- "Good Sport"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin10.8.0 (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> ################################################################################
> ### R code from Applied Predictive Modeling (2013) by Kuhn and Johnson.
> ### Copyright 2013 Kuhn and Johnson
> ### Web Page: http://www.appliedpredictivemodeling.com
> ### Contact: Max Kuhn (mxkuhn@gmail.com)
> ###
> ### Chapter 12 Discriminant Analysis and Other Linear Classification Models
> ###
> ### Required packages: AppliedPredictiveModeling, caret, doMC (optional),
> ### glmnet, lattice, MASS, pamr, pls, pROC, sparseLDA
> ###
> ### Data used: The grant application data. See the file 'CreateGrantData.R'
> ###
> ### Notes:
> ### 1) This code is provided without warranty.
> ###
> ### 2) This code should help the user reproduce the results in the
> ### text. There will be differences between this code and what is is
> ### the computing section. For example, the computing sections show
> ### how the source functions work (e.g. randomForest() or plsr()),
> ### which were not directly used when creating the book. Also, there may be
> ### syntax differences that occur over time as packages evolve. These files
> ### will reflect those changes.
> ###
> ### 3) In some cases, the calculations in the book were run in
> ### parallel. The sub-processes may reset the random number seed.
> ### Your results may slightly vary.
> ###
> ################################################################################
>
> ################################################################################
> ### Section 12.1 Case Study: Predicting Successful Grant Applications
>
> load("grantData.RData")
>
> library(caret)
Loading required package: lattice
Loading required package: ggplot2
> library(doMC)
Loading required package: foreach
Loading required package: iterators
Loading required package: parallel
> registerDoMC(12)
> library(plyr)
> library(reshape2)
>
> ## Look at two different ways to split and resample the data. A support vector
> ## machine is used to illustrate the differences. The full set of predictors
> ## is used.
>
> pre2008Data <- training[pre2008,]
> year2008Data <- rbind(training[-pre2008,], testing)
>
> set.seed(552)
> test2008 <- createDataPartition(year2008Data$Class, p = .25)[[1]]
>
> allData <- rbind(pre2008Data, year2008Data[-test2008,])
> holdout2008 <- year2008Data[test2008,]
>
> ## Use a common tuning grid for both approaches.
> svmrGrid <- expand.grid(sigma = c(.00007, .00009, .0001, .0002),
+ C = 2^(-3:8))
>
> ## Evaluate the model using overall 10-fold cross-validation
> ctrl0 <- trainControl(method = "cv",
+ summaryFunction = twoClassSummary,
+ classProbs = TRUE)
> set.seed(477)
> svmFit0 <- train(pre2008Data[,fullSet], pre2008Data$Class,
+ method = "svmRadial",
+ tuneGrid = svmrGrid,
+ preProc = c("center", "scale"),
+ metric = "ROC",
+ trControl = ctrl0)
Loading required package: kernlab
Loading required package: pROC
Type 'citation("pROC")' for a citation.
Attaching package: 'pROC'
The following object is masked from 'package:stats':
cov, smooth, var
> svmFit0
Support Vector Machines with Radial Basis Function Kernel
6633 samples
1070 predictors
2 classes: 'successful', 'unsuccessful'
Pre-processing: centered, scaled
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 5970, 5970, 5969, 5970, 5970, 5969, ...
Resampling results across tuning parameters:
sigma C ROC Sens Spec ROC SD Sens SD Spec SD
7e-05 0.125 0.806 0.88 0.562 0.0231 0.023 0.0168
7e-05 0.25 0.81 0.876 0.574 0.022 0.0254 0.0157
7e-05 0.5 0.836 0.837 0.677 0.018 0.029 0.0194
7e-05 1 0.853 0.803 0.757 0.0173 0.0308 0.0288
7e-05 2 0.863 0.805 0.78 0.0177 0.0275 0.0318
7e-05 4 0.869 0.8 0.789 0.0168 0.0279 0.0285
7e-05 8 0.874 0.798 0.798 0.0189 0.0313 0.0279
7e-05 16 0.876 0.796 0.797 0.0193 0.03 0.0235
7e-05 32 0.877 0.793 0.801 0.0184 0.0242 0.0287
7e-05 64 0.877 0.793 0.81 0.0178 0.034 0.0182
7e-05 128 0.876 0.793 0.812 0.0163 0.0233 0.0164
7e-05 256 0.873 0.794 0.812 0.0165 0.0239 0.0162
9e-05 0.125 0.8 0.876 0.551 0.0249 0.0209 0.023
9e-05 0.25 0.811 0.87 0.581 0.0219 0.0236 0.0186
9e-05 0.5 0.842 0.816 0.715 0.018 0.031 0.0258
9e-05 1 0.856 0.8 0.769 0.0176 0.0314 0.0306
9e-05 2 0.866 0.801 0.785 0.0173 0.0277 0.0315
9e-05 4 0.871 0.8 0.792 0.0172 0.0271 0.0269
9e-05 8 0.875 0.796 0.796 0.0188 0.0295 0.0259
9e-05 16 0.877 0.795 0.8 0.0186 0.0258 0.0246
9e-05 32 0.878 0.793 0.804 0.0179 0.0291 0.025
9e-05 64 0.877 0.794 0.813 0.0169 0.0297 0.0187
9e-05 128 0.876 0.795 0.813 0.0156 0.0228 0.0153
9e-05 256 0.874 0.788 0.814 0.0164 0.0205 0.017
1e-04 0.125 0.797 0.878 0.546 0.0257 0.0241 0.016
1e-04 0.25 0.814 0.863 0.596 0.0212 0.0319 0.0189
1e-04 0.5 0.845 0.81 0.728 0.018 0.0296 0.0247
1e-04 1 0.857 0.799 0.771 0.0179 0.0321 0.0298
1e-04 2 0.867 0.804 0.785 0.0173 0.0285 0.0312
1e-04 4 0.872 0.801 0.794 0.0174 0.0279 0.0266
1e-04 8 0.875 0.792 0.797 0.0187 0.0304 0.0242
1e-04 16 0.878 0.794 0.799 0.0184 0.0249 0.025
1e-04 32 0.878 0.795 0.806 0.0179 0.0335 0.0222
1e-04 64 0.878 0.796 0.812 0.0163 0.0245 0.0168
1e-04 128 0.876 0.796 0.811 0.0159 0.0215 0.0143
1e-04 256 0.874 0.788 0.816 0.0165 0.0209 0.0127
2e-04 0.125 0.786 0.861 0.542 0.0282 0.0356 0.0198
2e-04 0.25 0.836 0.81 0.701 0.0192 0.0382 0.0232
2e-04 0.5 0.853 0.792 0.765 0.0177 0.0342 0.0308
2e-04 1 0.864 0.8 0.782 0.0177 0.028 0.036
2e-04 2 0.87 0.796 0.789 0.0174 0.0258 0.0277
2e-04 4 0.875 0.795 0.793 0.0182 0.0295 0.026
2e-04 8 0.878 0.793 0.801 0.0176 0.0293 0.0196
2e-04 16 0.879 0.796 0.809 0.0167 0.033 0.0203
2e-04 32 0.88 0.795 0.811 0.0153 0.0227 0.0169
2e-04 64 0.879 0.792 0.813 0.0155 0.0194 0.0171
2e-04 128 0.877 0.786 0.816 0.0162 0.0235 0.0128
2e-04 256 0.877 0.789 0.822 0.0156 0.0241 0.0159
ROC was used to select the optimal model using the largest value.
The final values used for the model were sigma = 2e-04 and C = 32.
>
> ### Now fit the single 2008 test set
> ctrl00 <- trainControl(method = "LGOCV",
+ summaryFunction = twoClassSummary,
+ classProbs = TRUE,
+ index = list(TestSet = 1:nrow(pre2008Data)))
>
>
> set.seed(476)
> svmFit00 <- train(allData[,fullSet], allData$Class,
+ method = "svmRadial",
+ tuneGrid = svmrGrid,
+ preProc = c("center", "scale"),
+ metric = "ROC",
+ trControl = ctrl00)
> svmFit00
Support Vector Machines with Radial Basis Function Kernel
8189 samples
1070 predictors
2 classes: 'successful', 'unsuccessful'
Pre-processing: centered, scaled
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%)
Summary of sample sizes: 6633
Resampling results across tuning parameters:
sigma C ROC Sens Spec
7e-05 0.125 0.806 0.968 0.494
7e-05 0.25 0.814 0.965 0.512
7e-05 0.5 0.855 0.921 0.651
7e-05 1 0.873 0.882 0.753
7e-05 2 0.882 0.873 0.783
7e-05 4 0.886 0.856 0.802
7e-05 8 0.887 0.835 0.813
7e-05 16 0.883 0.812 0.813
7e-05 32 0.875 0.786 0.814
7e-05 64 0.872 0.794 0.816
7e-05 128 0.872 0.791 0.807
7e-05 256 0.869 0.793 0.811
9e-05 0.125 0.798 0.97 0.478
9e-05 0.25 0.819 0.96 0.536
9e-05 0.5 0.864 0.902 0.688
9e-05 1 0.876 0.868 0.765
9e-05 2 0.885 0.863 0.785
9e-05 4 0.888 0.84 0.807
9e-05 8 0.887 0.822 0.806
9e-05 16 0.88 0.801 0.816
9e-05 32 0.874 0.791 0.821
9e-05 64 0.873 0.8 0.811
9e-05 128 0.872 0.791 0.812
9e-05 256 0.865 0.775 0.803
1e-04 0.125 0.795 0.961 0.476
1e-04 0.25 0.825 0.946 0.563
1e-04 0.5 0.867 0.895 0.709
1e-04 1 0.877 0.87 0.765
1e-04 2 0.885 0.858 0.786
1e-04 4 0.888 0.835 0.804
1e-04 8 0.887 0.819 0.809
1e-04 16 0.88 0.791 0.818
1e-04 32 0.875 0.798 0.814
1e-04 64 0.874 0.796 0.809
1e-04 128 0.871 0.794 0.807
1e-04 256 0.863 0.78 0.79
2e-04 0.125 0.791 0.942 0.504
2e-04 0.25 0.86 0.888 0.684
2e-04 0.5 0.875 0.865 0.752
2e-04 1 0.884 0.849 0.783
2e-04 2 0.886 0.833 0.798
2e-04 4 0.888 0.821 0.803
2e-04 8 0.883 0.805 0.814
2e-04 16 0.88 0.803 0.817
2e-04 32 0.877 0.796 0.816
2e-04 64 0.868 0.78 0.805
2e-04 128 0.862 0.779 0.791
2e-04 256 0.857 0.78 0.779
ROC was used to select the optimal model using the largest value.
The final values used for the model were sigma = 2e-04 and C = 4.
>
> ## Combine the two sets of results and plot
>
> grid0 <- subset(svmFit0$results, sigma == svmFit0$bestTune$sigma)
> grid0$Model <- "10-Fold Cross-Validation"
>
> grid00 <- subset(svmFit00$results, sigma == svmFit00$bestTune$sigma)
> grid00$Model <- "Single 2008 Test Set"
>
> plotData <- rbind(grid00, grid0)
>
> plotData <- plotData[!is.na(plotData$ROC),]
> xyplot(ROC ~ C, data = plotData,
+ groups = Model,
+ type = c("g", "o"),
+ scales = list(x = list(log = 2)),
+ auto.key = list(columns = 1))
>
> ################################################################################
> ### Section 12.2 Logistic Regression
>
> modelFit <- glm(Class ~ Day, data = training[pre2008,], family = binomial)
> dataGrid <- data.frame(Day = seq(0, 365, length = 500))
> dataGrid$Linear <- 1 - predict(modelFit, dataGrid, type = "response")
> linear2008 <- auc(roc(response = training[-pre2008, "Class"],
+ predictor = 1 - predict(modelFit,
+ training[-pre2008,],
+ type = "response"),
+ levels = rev(levels(training[-pre2008, "Class"]))))
>
>
> modelFit2 <- glm(Class ~ Day + I(Day^2),
+ data = training[pre2008,],
+ family = binomial)
> dataGrid$Quadratic <- 1 - predict(modelFit2, dataGrid, type = "response")
> quad2008 <- auc(roc(response = training[-pre2008, "Class"],
+ predictor = 1 - predict(modelFit2,
+ training[-pre2008,],
+ type = "response"),
+ levels = rev(levels(training[-pre2008, "Class"]))))
>
> dataGrid <- melt(dataGrid, id.vars = "Day")
>
> byDay <- training[pre2008, c("Day", "Class")]
> byDay$Binned <- cut(byDay$Day, seq(0, 360, by = 5))
>
> observedProps <- ddply(byDay, .(Binned),
+ function(x) c(n = nrow(x), mean = mean(x$Class == "successful")))
> observedProps$midpoint <- seq(2.5, 357.5, by = 5)
>
> xyplot(value ~ Day|variable, data = dataGrid,
+ ylab = "Probability of A Successful Grant",
+ ylim = extendrange(0:1),
+ between = list(x = 1),
+ panel = function(...)
+ {
+ panel.xyplot(x = observedProps$midpoint, observedProps$mean,
+ pch = 16., col = rgb(.2, .2, .2, .5))
+ panel.xyplot(..., type = "l", col = "black", lwd = 2)
+ })
>
> ## For the reduced set of factors, fit the logistic regression model (linear and
> ## quadratic) and evaluate on the
> training$Day2 <- training$Day^2
> testing$Day2 <- testing$Day^2
> fullSet <- c(fullSet, "Day2")
> reducedSet <- c(reducedSet, "Day2")
>
> ## This control object will be used across multiple models so that the
> ## data splitting is consistent
>
> ctrl <- trainControl(method = "LGOCV",
+ summaryFunction = twoClassSummary,
+ classProbs = TRUE,
+ index = list(TrainSet = pre2008),
+ savePredictions = TRUE)
>
> set.seed(476)
> lrFit <- train(x = training[,reducedSet],
+ y = training$Class,
+ method = "glm",
+ metric = "ROC",
+ trControl = ctrl)
> lrFit
Generalized Linear Model
8190 samples
253 predictors
2 classes: 'successful', 'unsuccessful'
No pre-processing
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%)
Summary of sample sizes: 6633
Resampling results
ROC Sens Spec
0.872 0.804 0.822
> set.seed(476)
> lrFit2 <- train(x = training[,c(fullSet, "Day2")],
+ y = training$Class,
+ method = "glm",
+ metric = "ROC",
+ trControl = ctrl)
Warning messages:
1: glm.fit: algorithm did not converge
2: glm.fit: fitted probabilities numerically 0 or 1 occurred
3: In predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
prediction from a rank-deficient fit may be misleading
4: In predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
prediction from a rank-deficient fit may be misleading
5: glm.fit: fitted probabilities numerically 0 or 1 occurred
> lrFit2
Generalized Linear Model
8190 samples
1072 predictors
2 classes: 'successful', 'unsuccessful'
No pre-processing
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%)
Summary of sample sizes: 6633
Resampling results
ROC Sens Spec
0.782 0.77 0.761
>
> lrFit$pred <- merge(lrFit$pred, lrFit$bestTune)
>
> ## Get the confusion matrices for the hold-out set
> lrCM <- confusionMatrix(lrFit, norm = "none")
> lrCM
Repeated Train/Test Splits Estimated (1 reps, 0.75%) Confusion Matrix
(entries are un-normalized counts)
Loading required package: class
Confusion Matrix and Statistics
Reference
Prediction successful unsuccessful
successful 458 176
unsuccessful 112 811
Accuracy : 0.815
95% CI : (0.7948, 0.834)
No Information Rate : 0.6339
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.6107
Mcnemar's Test P-Value : 0.0002054
Sensitivity : 0.8035
Specificity : 0.8217
Pos Pred Value : 0.7224
Neg Pred Value : 0.8787
Prevalence : 0.3661
Detection Rate : 0.2942
Detection Prevalence : 0.4072
Balanced Accuracy : 0.8126
'Positive' Class : successful
> lrCM2 <- confusionMatrix(lrFit2, norm = "none")
> lrCM2
Repeated Train/Test Splits Estimated (1 reps, 0.75%) Confusion Matrix
(entries are un-normalized counts)
Confusion Matrix and Statistics
Reference
Prediction successful unsuccessful
successful 439 236
unsuccessful 131 751
Accuracy : 0.7643
95% CI : (0.7424, 0.7852)
No Information Rate : 0.6339
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.5112
Mcnemar's Test P-Value : 5.675e-08
Sensitivity : 0.7702
Specificity : 0.7609
Pos Pred Value : 0.6504
Neg Pred Value : 0.8515
Prevalence : 0.3661
Detection Rate : 0.2820
Detection Prevalence : 0.4335
Balanced Accuracy : 0.7655
'Positive' Class : successful
>
> ## Get the area under the ROC curve for the hold-out set
> lrRoc <- roc(response = lrFit$pred$obs,
+ predictor = lrFit$pred$successful,
+ levels = rev(levels(lrFit$pred$obs)))
> lrRoc2 <- roc(response = lrFit2$pred$obs,
+ predictor = lrFit2$pred$successful,
+ levels = rev(levels(lrFit2$pred$obs)))
> lrImp <- varImp(lrFit, scale = FALSE)
>
> plot(lrRoc, legacy.axes = TRUE)
Call:
roc.default(response = lrFit$pred$obs, predictor = lrFit$pred$successful, levels = rev(levels(lrFit$pred$obs)))
Data: lrFit$pred$successful in 987 controls (lrFit$pred$obs unsuccessful) < 570 cases (lrFit$pred$obs successful).
Area under the curve: 0.8715
>
> ################################################################################
> ### Section 12.3 Linear Discriminant Analysis
>
> ## Fit the model to the reduced set
> set.seed(476)
> ldaFit <- train(x = training[,reducedSet],
+ y = training$Class,
+ method = "lda",
+ preProc = c("center","scale"),
+ metric = "ROC",
+ trControl = ctrl)
Loading required package: MASS
> ldaFit
Linear Discriminant Analysis
8190 samples
253 predictors
2 classes: 'successful', 'unsuccessful'
Pre-processing: centered, scaled
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%)
Summary of sample sizes: 6633
Resampling results
ROC Sens Spec
0.889 0.804 0.823
>
> ldaFit$pred <- merge(ldaFit$pred, ldaFit$bestTune)
> ldaCM <- confusionMatrix(ldaFit, norm = "none")
> ldaCM
Repeated Train/Test Splits Estimated (1 reps, 0.75%) Confusion Matrix
(entries are un-normalized counts)
Confusion Matrix and Statistics
Reference
Prediction successful unsuccessful
successful 458 175
unsuccessful 112 812
Accuracy : 0.8157
95% CI : (0.7955, 0.8346)
No Information Rate : 0.6339
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.6119
Mcnemar's Test P-Value : 0.0002525
Sensitivity : 0.8035
Specificity : 0.8227
Pos Pred Value : 0.7235
Neg Pred Value : 0.8788
Prevalence : 0.3661
Detection Rate : 0.2942
Detection Prevalence : 0.4066
Balanced Accuracy : 0.8131
'Positive' Class : successful
> ldaRoc <- roc(response = ldaFit$pred$obs,
+ predictor = ldaFit$pred$successful,
+ levels = rev(levels(ldaFit$pred$obs)))
> plot(lrRoc, type = "s", col = rgb(.2, .2, .2, .2), legacy.axes = TRUE)
Call:
roc.default(response = lrFit$pred$obs, predictor = lrFit$pred$successful, levels = rev(levels(lrFit$pred$obs)))
Data: lrFit$pred$successful in 987 controls (lrFit$pred$obs unsuccessful) < 570 cases (lrFit$pred$obs successful).
Area under the curve: 0.8715
> plot(ldaRoc, add = TRUE, type = "s", legacy.axes = TRUE)
Call:
roc.default(response = ldaFit$pred$obs, predictor = ldaFit$pred$successful, levels = rev(levels(ldaFit$pred$obs)))
Data: ldaFit$pred$successful in 987 controls (ldaFit$pred$obs unsuccessful) < 570 cases (ldaFit$pred$obs successful).
Area under the curve: 0.8892
>
> ################################################################################
> ### Section 12.4 Partial Least Squares Discriminant Analysis
>
> ## This model uses all of the predictors
> set.seed(476)
> plsFit <- train(x = training[,fullSet],
+ y = training$Class,
+ method = "pls",
+ tuneGrid = expand.grid(ncomp = 1:10),
+ preProc = c("center","scale"),
+ metric = "ROC",
+ probMethod = "Bayes",
+ trControl = ctrl)
Loading required package: pls
Attaching package: 'pls'
The following object is masked from 'package:caret':
R2
The following object is masked from 'package:stats':
loadings
> plsFit
Partial Least Squares
8190 samples
1071 predictors
2 classes: 'successful', 'unsuccessful'
Pre-processing: centered, scaled
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%)
Summary of sample sizes: 6633
Resampling results across tuning parameters:
ncomp ROC Sens Spec
1 0.821 0.863 0.667
2 0.847 0.83 0.749
3 0.863 0.851 0.749
4 0.863 0.835 0.754
5 0.864 0.839 0.77
6 0.87 0.837 0.77
7 0.865 0.816 0.776
8 0.862 0.816 0.779
9 0.864 0.825 0.778
10 0.858 0.812 0.782
ROC was used to select the optimal model using the largest value.
The final value used for the model was ncomp = 6.
>
> plsImpGrant <- varImp(plsFit, scale = FALSE)
>
> bestPlsNcomp <- plsFit$results[best(plsFit$results, "ROC", maximize = TRUE), "ncomp"]
> bestPlsROC <- plsFit$results[best(plsFit$results, "ROC", maximize = TRUE), "ROC"]
>
> ## Only keep the final tuning parameter data
> plsFit$pred <- merge(plsFit$pred, plsFit$bestTune)
>
> plsRoc <- roc(response = plsFit$pred$obs,
+ predictor = plsFit$pred$successful,
+ levels = rev(levels(plsFit$pred$obs)))
>
> ### PLS confusion matrix information
> plsCM <- confusionMatrix(plsFit, norm = "none")
> plsCM
Repeated Train/Test Splits Estimated (1 reps, 0.75%) Confusion Matrix
(entries are un-normalized counts)
Confusion Matrix and Statistics
Reference
Prediction successful unsuccessful
successful 477 227
unsuccessful 93 760
Accuracy : 0.7945
95% CI : (0.7735, 0.8143)
No Information Rate : 0.6339
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.5781
Mcnemar's Test P-Value : 1.046e-13
Sensitivity : 0.8368
Specificity : 0.7700
Pos Pred Value : 0.6776
Neg Pred Value : 0.8910
Prevalence : 0.3661
Detection Rate : 0.3064
Detection Prevalence : 0.4522
Balanced Accuracy : 0.8034
'Positive' Class : successful
>
> ## Now fit a model that uses a smaller set of predictors chosen by unsupervised
> ## filtering.
>
> set.seed(476)
> plsFit2 <- train(x = training[,reducedSet],
+ y = training$Class,
+ method = "pls",
+ tuneGrid = expand.grid(ncomp = 1:10),
+ preProc = c("center","scale"),
+ metric = "ROC",
+ probMethod = "Bayes",
+ trControl = ctrl)
> plsFit2
Partial Least Squares
8190 samples
253 predictors
2 classes: 'successful', 'unsuccessful'
Pre-processing: centered, scaled
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%)
Summary of sample sizes: 6633
Resampling results across tuning parameters:
ncomp ROC Sens Spec
1 0.836 0.912 0.616
2 0.868 0.858 0.752
3 0.889 0.874 0.762
4 0.895 0.86 0.777
5 0.895 0.846 0.79
6 0.894 0.832 0.795
7 0.89 0.823 0.806
8 0.888 0.83 0.803
9 0.887 0.83 0.803
10 0.884 0.821 0.807
ROC was used to select the optimal model using the largest value.
The final value used for the model was ncomp = 4.
>
> bestPlsNcomp2 <- plsFit2$results[best(plsFit2$results, "ROC", maximize = TRUE), "ncomp"]
> bestPlsROC2 <- plsFit2$results[best(plsFit2$results, "ROC", maximize = TRUE), "ROC"]
>
> plsFit2$pred <- merge(plsFit2$pred, plsFit2$bestTune)
>
> plsRoc2 <- roc(response = plsFit2$pred$obs,
+ predictor = plsFit2$pred$successful,
+ levels = rev(levels(plsFit2$pred$obs)))
> plsCM2 <- confusionMatrix(plsFit2, norm = "none")
> plsCM2
Repeated Train/Test Splits Estimated (1 reps, 0.75%) Confusion Matrix
(entries are un-normalized counts)
Confusion Matrix and Statistics
Reference
Prediction successful unsuccessful
successful 490 220
unsuccessful 80 767
Accuracy : 0.8073
95% CI : (0.7868, 0.8266)
No Information Rate : 0.6339
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.6053
Mcnemar's Test P-Value : 1.014e-15
Sensitivity : 0.8596
Specificity : 0.7771
Pos Pred Value : 0.6901
Neg Pred Value : 0.9055
Prevalence : 0.3661
Detection Rate : 0.3147
Detection Prevalence : 0.4560
Balanced Accuracy : 0.8184
'Positive' Class : successful
>
> pls.ROC <- cbind(plsFit$results,Descriptors="Full Set")
> pls2.ROC <- cbind(plsFit2$results,Descriptors="Reduced Set")
>
> plsCompareROC <- data.frame(rbind(pls.ROC,pls2.ROC))
>
> xyplot(ROC ~ ncomp,
+ data = plsCompareROC,
+ xlab = "# Components",
+ ylab = "ROC (2008 Hold-Out Data)",
+ auto.key = list(columns = 2),
+ groups = Descriptors,
+ type = c("o", "g"))
>
> ## Plot ROC curves and variable importance scores
> plot(ldaRoc, type = "s", col = rgb(.2, .2, .2, .2), legacy.axes = TRUE)
Call:
roc.default(response = ldaFit$pred$obs, predictor = ldaFit$pred$successful, levels = rev(levels(ldaFit$pred$obs)))
Data: ldaFit$pred$successful in 987 controls (ldaFit$pred$obs unsuccessful) < 570 cases (ldaFit$pred$obs successful).
Area under the curve: 0.8892
> plot(lrRoc, type = "s", col = rgb(.2, .2, .2, .2), add = TRUE, legacy.axes = TRUE)
Call:
roc.default(response = lrFit$pred$obs, predictor = lrFit$pred$successful, levels = rev(levels(lrFit$pred$obs)))
Data: lrFit$pred$successful in 987 controls (lrFit$pred$obs unsuccessful) < 570 cases (lrFit$pred$obs successful).
Area under the curve: 0.8715
> plot(plsRoc2, type = "s", add = TRUE, legacy.axes = TRUE)
Call:
roc.default(response = plsFit2$pred$obs, predictor = plsFit2$pred$successful, levels = rev(levels(plsFit2$pred$obs)))
Data: plsFit2$pred$successful in 987 controls (plsFit2$pred$obs unsuccessful) < 570 cases (plsFit2$pred$obs successful).
Area under the curve: 0.895
>
> plot(plsImpGrant, top=20, scales = list(y = list(cex = .95)))
>
> ################################################################################
> ### Section 12.5 Penalized Models
>
> ## The glmnet model
> glmnGrid <- expand.grid(alpha = c(0, .1, .2, .4, .6, .8, 1),
+ lambda = seq(.01, .2, length = 40))
> set.seed(476)
> glmnFit <- train(x = training[,fullSet],
+ y = training$Class,
+ method = "glmnet",
+ tuneGrid = glmnGrid,
+ preProc = c("center", "scale"),
+ metric = "ROC",
+ trControl = ctrl)
Loading required package: glmnet
Loading required package: Matrix
Loaded glmnet 1.9-3
Attaching package: 'glmnet'
The following object is masked from 'package:pROC':
auc
> glmnFit
glmnet
8190 samples
1071 predictors
2 classes: 'successful', 'unsuccessful'
Pre-processing: centered, scaled
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%)
Summary of sample sizes: 6633
Resampling results across tuning parameters:
alpha lambda ROC Sens Spec
0 0.01 0.856 0.8 0.791
0 0.0149 0.856 0.8 0.791
0 0.0197 0.856 0.8 0.791
0 0.0246 0.858 0.804 0.796
0 0.0295 0.86 0.804 0.801
0 0.0344 0.861 0.802 0.8
0 0.0392 0.862 0.804 0.801
0 0.0441 0.863 0.804 0.801
0 0.049 0.863 0.798 0.8
0 0.0538 0.864 0.807 0.799
0 0.0587 0.866 0.809 0.796
0 0.0636 0.864 0.807 0.797
0 0.0685 0.866 0.809 0.797
0 0.0733 0.865 0.807 0.797
0 0.0782 0.866 0.811 0.794
0 0.0831 0.867 0.814 0.792
0 0.0879 0.866 0.814 0.792
0 0.0928 0.866 0.816 0.792
0 0.0977 0.867 0.816 0.793
0 0.103 0.867 0.819 0.791
0 0.107 0.867 0.818 0.79
0 0.112 0.867 0.819 0.79
0 0.117 0.866 0.819 0.789
0 0.122 0.866 0.816 0.791
0 0.127 0.867 0.821 0.794
0 0.132 0.866 0.819 0.791
0 0.137 0.866 0.823 0.789
0 0.142 0.867 0.823 0.789
0 0.146 0.866 0.818 0.792
0 0.151 0.865 0.818 0.792
0 0.156 0.866 0.821 0.789
0 0.161 0.866 0.819 0.79
0 0.166 0.866 0.825 0.788
0 0.171 0.867 0.825 0.788
0 0.176 0.866 0.821 0.792
0 0.181 0.865 0.823 0.791
0 0.185 0.866 0.826 0.788
0 0.19 0.865 0.825 0.788
0 0.195 0.866 0.821 0.791
0 0.2 0.865 0.823 0.789
0.1 0.01 0.866 0.809 0.797
0.1 0.0149 0.874 0.823 0.798
0.1 0.0197 0.881 0.826 0.797
0.1 0.0246 0.886 0.828 0.803
0.1 0.0295 0.89 0.835 0.803
0.1 0.0344 0.892 0.839 0.81
0.1 0.0392 0.895 0.84 0.811
0.1 0.0441 0.898 0.851 0.809
0.1 0.049 0.9 0.851 0.809
0.1 0.0538 0.9 0.853 0.814
0.1 0.0587 0.902 0.858 0.805
0.1 0.0636 0.903 0.86 0.809
0.1 0.0685 0.904 0.868 0.803
0.1 0.0733 0.906 0.874 0.799
0.1 0.0782 0.906 0.87 0.802
0.1 0.0831 0.906 0.872 0.801
0.1 0.0879 0.907 0.877 0.8
0.1 0.0928 0.908 0.877 0.794
0.1 0.0977 0.908 0.879 0.795
0.1 0.103 0.907 0.877 0.795
0.1 0.107 0.907 0.881 0.792
0.1 0.112 0.907 0.881 0.797
0.1 0.117 0.907 0.884 0.795
0.1 0.122 0.908 0.886 0.791
0.1 0.127 0.907 0.882 0.791
0.1 0.132 0.909 0.884 0.79
0.1 0.137 0.908 0.886 0.789
0.1 0.142 0.908 0.884 0.786
0.1 0.146 0.909 0.886 0.784
0.1 0.151 0.908 0.881 0.787
0.1 0.156 0.908 0.881 0.785
0.1 0.161 0.909 0.884 0.787
0.1 0.166 0.908 0.882 0.787
0.1 0.171 0.91 0.889 0.785
0.1 0.176 0.91 0.889 0.784
0.1 0.181 0.91 0.889 0.785
0.1 0.185 0.909 0.886 0.788
0.1 0.19 0.91 0.893 0.778
0.1 0.195 0.909 0.889 0.784
0.1 0.2 0.909 0.891 0.781
0.2 0.01 0.878 0.83 0.8
0.2 0.0149 0.887 0.835 0.803
0.2 0.0197 0.891 0.839 0.804
0.2 0.0246 0.896 0.849 0.802
0.2 0.0295 0.899 0.853 0.805
0.2 0.0344 0.902 0.858 0.8
0.2 0.0392 0.902 0.86 0.798
0.2 0.0441 0.903 0.874 0.794
0.2 0.049 0.904 0.879 0.802
0.2 0.0538 0.904 0.879 0.794
0.2 0.0587 0.905 0.881 0.797
0.2 0.0636 0.904 0.881 0.8
0.2 0.0685 0.905 0.888 0.793
0.2 0.0733 0.907 0.888 0.793
0.2 0.0782 0.905 0.886 0.792
0.2 0.0831 0.906 0.884 0.793
0.2 0.0879 0.907 0.886 0.788
0.2 0.0928 0.905 0.882 0.789
0.2 0.0977 0.906 0.881 0.791
0.2 0.103 0.906 0.888 0.777
0.2 0.107 0.907 0.889 0.778
0.2 0.112 0.906 0.884 0.774
0.2 0.117 0.905 0.882 0.777
0.2 0.122 0.905 0.881 0.779
0.2 0.127 0.905 0.879 0.778
0.2 0.132 0.905 0.884 0.772
0.2 0.137 0.905 0.884 0.77
0.2 0.142 0.904 0.877 0.779
0.2 0.146 0.904 0.879 0.773
0.2 0.151 0.905 0.884 0.77
0.2 0.156 0.904 0.879 0.778
0.2 0.161 0.905 0.886 0.768
0.2 0.166 0.905 0.898 0.761
0.2 0.171 0.904 0.891 0.766
0.2 0.176 0.904 0.884 0.775
0.2 0.181 0.903 0.875 0.772
0.2 0.185 0.905 0.898 0.759
0.2 0.19 0.904 0.886 0.764
0.2 0.195 0.903 0.879 0.772
0.2 0.2 0.903 0.888 0.765
0.4 0.01 0.887 0.84 0.798
0.4 0.0149 0.893 0.853 0.796
0.4 0.0197 0.896 0.858 0.795
0.4 0.0246 0.897 0.863 0.796
0.4 0.0295 0.897 0.87 0.793
0.4 0.0344 0.897 0.875 0.786
0.4 0.0392 0.897 0.868 0.799
0.4 0.0441 0.898 0.875 0.793
0.4 0.049 0.898 0.874 0.79
0.4 0.0538 0.898 0.874 0.794
0.4 0.0587 0.897 0.874 0.78
0.4 0.0636 0.897 0.875 0.778
0.4 0.0685 0.9 0.881 0.766
0.4 0.0733 0.898 0.879 0.767
0.4 0.0782 0.899 0.882 0.76
0.4 0.0831 0.9 0.879 0.765
0.4 0.0879 0.899 0.877 0.765
0.4 0.0928 0.902 0.888 0.758
0.4 0.0977 0.902 0.888 0.756
0.4 0.103 0.901 0.882 0.765
0.4 0.107 0.902 0.886 0.769
0.4 0.112 0.902 0.886 0.764
0.4 0.117 0.904 0.9 0.757
0.4 0.122 0.904 0.9 0.749
0.4 0.127 0.903 0.902 0.748
0.4 0.132 0.903 0.904 0.743
0.4 0.137 0.901 0.893 0.747
0.4 0.142 0.903 0.9 0.747
0.4 0.146 0.9 0.914 0.719
0.4 0.151 0.902 0.926 0.708
0.4 0.156 0.901 0.944 0.699
0.4 0.161 0.896 0.902 0.716
0.4 0.166 0.897 0.912 0.716
0.4 0.171 0.901 0.935 0.705
0.4 0.176 0.898 0.954 0.688
0.4 0.181 0.894 0.951 0.686
0.4 0.185 0.891 0.919 0.7
0.4 0.19 0.877 0.891 0.693
0.4 0.195 0.877 0.926 0.676
0.4 0.2 0.881 0.94 0.674
0.6 0.01 0.889 0.842 0.803
0.6 0.0149 0.892 0.846 0.8
0.6 0.0197 0.892 0.863 0.793
0.6 0.0246 0.893 0.868 0.787
0.6 0.0295 0.893 0.865 0.785
0.6 0.0344 0.894 0.875 0.78
0.6 0.0392 0.893 0.874 0.777
0.6 0.0441 0.894 0.872 0.779
0.6 0.049 0.893 0.865 0.775
0.6 0.0538 0.897 0.879 0.767
0.6 0.0587 0.897 0.875 0.765
0.6 0.0636 0.9 0.879 0.76
0.6 0.0685 0.899 0.879 0.764
0.6 0.0733 0.901 0.889 0.756
0.6 0.0782 0.902 0.889 0.756
0.6 0.0831 0.901 0.895 0.747
0.6 0.0879 0.903 0.907 0.737
0.6 0.0928 0.9 0.896 0.744
0.6 0.0977 0.899 0.898 0.739
0.6 0.103 0.902 0.918 0.721
0.6 0.107 0.901 0.944 0.7
0.6 0.112 0.902 0.93 0.708
0.6 0.117 0.89 0.909 0.702
0.6 0.122 0.894 0.939 0.693
0.6 0.127 0.888 0.968 0.669
0.6 0.132 0.876 0.9 0.688
0.6 0.137 0.876 0.916 0.679
0.6 0.142 0.869 0.981 0.652
0.6 0.146 0.87 0.977 0.656
0.6 0.151 0.868 0.991 0.643
0.6 0.156 0.868 0.984 0.648
0.6 0.161 0.869 0.988 0.644
0.6 0.166 0.872 0.991 0.638
0.6 0.171 0.869 0.991 0.638
0.6 0.176 0.869 0.991 0.638
0.6 0.181 0.872 0.991 0.638
0.6 0.185 0.823 0.991 0.638
0.6 0.19 0.823 0.991 0.638
0.6 0.195 0.823 0.991 0.638
0.6 0.2 0.823 0.991 0.638
0.8 0.01 0.89 0.842 0.801
0.8 0.0149 0.89 0.858 0.795
0.8 0.0197 0.888 0.861 0.784
0.8 0.0246 0.89 0.867 0.777
0.8 0.0295 0.89 0.87 0.773
0.8 0.0344 0.892 0.867 0.775
0.8 0.0392 0.892 0.87 0.766
0.8 0.0441 0.895 0.86 0.768
0.8 0.049 0.896 0.868 0.767
0.8 0.0538 0.898 0.884 0.76
0.8 0.0587 0.899 0.882 0.76
0.8 0.0636 0.898 0.872 0.759
0.8 0.0685 0.901 0.904 0.74
0.8 0.0733 0.902 0.918 0.723
0.8 0.0782 0.897 0.898 0.72
0.8 0.0831 0.901 0.937 0.706
0.8 0.0879 0.897 0.953 0.683
0.8 0.0928 0.892 0.914 0.702
0.8 0.0977 0.877 0.904 0.688
0.8 0.103 0.881 0.954 0.671
0.8 0.107 0.868 0.981 0.652
0.8 0.112 0.868 0.974 0.656
0.8 0.117 0.868 0.986 0.646
0.8 0.122 0.868 0.982 0.648
0.8 0.127 0.872 0.991 0.638
0.8 0.132 0.868 0.991 0.638
0.8 0.137 0.823 0.991 0.638
0.8 0.142 0.823 0.991 0.638
0.8 0.146 0.823 0.991 0.638
0.8 0.151 0.823 0.991 0.638
0.8 0.156 0.823 0.991 0.638
0.8 0.161 0.823 0.991 0.638
0.8 0.166 0.815 0.991 0.638
0.8 0.171 0.815 0.991 0.638
0.8 0.176 0.815 0.991 0.638
0.8 0.181 0.815 0.991 0.638
0.8 0.185 0.815 0.991 0.638
0.8 0.19 0.815 0.991 0.638
0.8 0.195 0.815 0.991 0.638
0.8 0.2 0.815 0.991 0.638
1 0.01 0.889 0.854 0.799
1 0.0149 0.888 0.858 0.791
1 0.0197 0.887 0.858 0.774
1 0.0246 0.887 0.854 0.775
1 0.0295 0.888 0.851 0.775
1 0.0344 0.891 0.861 0.772
1 0.0392 0.897 0.881 0.761
1 0.0441 0.897 0.882 0.761
1 0.049 0.898 0.889 0.751
1 0.0538 0.9 0.891 0.752
1 0.0587 0.899 0.904 0.737
1 0.0636 0.901 0.926 0.708
1 0.0685 0.897 0.949 0.688
1 0.0733 0.898 0.94 0.693
1 0.0782 0.891 0.965 0.671
1 0.0831 0.868 0.949 0.666
1 0.0879 0.868 0.944 0.667
1 0.0928 0.867 0.991 0.643
1 0.0977 0.867 0.991 0.643
1 0.103 0.872 0.991 0.638
1 0.107 0.867 0.991 0.638
1 0.112 0.823 0.991 0.638
1 0.117 0.823 0.991 0.638
1 0.122 0.823 0.991 0.638
1 0.127 0.823 0.991 0.638
1 0.132 0.815 0.991 0.638
1 0.137 0.815 0.991 0.638
1 0.142 0.815 0.991 0.638
1 0.146 0.815 0.991 0.638
1 0.151 0.815 0.991 0.638
1 0.156 0.815 0.991 0.638
1 0.161 0.815 0.991 0.638
1 0.166 0.815 0.991 0.638
1 0.171 0.815 0.991 0.638
1 0.176 0.815 0.991 0.638
1 0.181 0.815 0.991 0.638
1 0.185 0.815 0.991 0.638
1 0.19 0.815 0.991 0.638
1 0.195 0.815 0 1
1 0.2 0.815 0 1
ROC was used to select the optimal model using the largest value.
The final values used for the model were alpha = 0.1 and lambda = 0.176.
>
> glmnet2008 <- merge(glmnFit$pred, glmnFit$bestTune)
> glmnetCM <- confusionMatrix(glmnFit, norm = "none")
> glmnetCM
Repeated Train/Test Splits Estimated (1 reps, 0.75%) Confusion Matrix
(entries are un-normalized counts)
Confusion Matrix and Statistics
Reference
Prediction successful unsuccessful
successful 507 213
unsuccessful 63 774
Accuracy : 0.8227
95% CI : (0.8028, 0.8414)
No Information Rate : 0.6339
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.6382
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.8895
Specificity : 0.7842
Pos Pred Value : 0.7042
Neg Pred Value : 0.9247
Prevalence : 0.3661
Detection Rate : 0.3256
Detection Prevalence : 0.4624
Balanced Accuracy : 0.8368
'Positive' Class : successful
>
> glmnetRoc <- roc(response = glmnet2008$obs,
+ predictor = glmnet2008$successful,
+ levels = rev(levels(glmnet2008$obs)))
>
> glmnFit0 <- glmnFit
> glmnFit0$results$lambda <- format(round(glmnFit0$results$lambda, 3))
>
> glmnPlot <- plot(glmnFit0,
+ plotType = "level",
+ cuts = 15,
+ scales = list(x = list(rot = 90, cex = .65)))
>
> update(glmnPlot,
+ ylab = "Mixing Percentage\nRidge <---------> Lasso",
+ sub = "",
+ main = "Area Under the ROC Curve",
+ xlab = "Amount of Regularization")
>
> plot(plsRoc2, type = "s", col = rgb(.2, .2, .2, .2), legacy.axes = TRUE)
Call:
roc.default(response = plsFit2$pred$obs, predictor = plsFit2$pred$successful, levels = rev(levels(plsFit2$pred$obs)))
Data: plsFit2$pred$successful in 987 controls (plsFit2$pred$obs unsuccessful) < 570 cases (plsFit2$pred$obs successful).
Area under the curve: 0.895
> plot(ldaRoc, type = "s", add = TRUE, col = rgb(.2, .2, .2, .2), legacy.axes = TRUE)
Call:
roc.default(response = ldaFit$pred$obs, predictor = ldaFit$pred$successful, levels = rev(levels(ldaFit$pred$obs)))
Data: ldaFit$pred$successful in 987 controls (ldaFit$pred$obs unsuccessful) < 570 cases (ldaFit$pred$obs successful).
Area under the curve: 0.8892
> plot(lrRoc, type = "s", col = rgb(.2, .2, .2, .2), add = TRUE, legacy.axes = TRUE)
Call:
roc.default(response = lrFit$pred$obs, predictor = lrFit$pred$successful, levels = rev(levels(lrFit$pred$obs)))
Data: lrFit$pred$successful in 987 controls (lrFit$pred$obs unsuccessful) < 570 cases (lrFit$pred$obs successful).
Area under the curve: 0.8715
> plot(glmnetRoc, type = "s", add = TRUE, legacy.axes = TRUE)
Call:
roc.default(response = glmnet2008$obs, predictor = glmnet2008$successful, levels = rev(levels(glmnet2008$obs)))
Data: glmnet2008$successful in 987 controls (glmnet2008$obs unsuccessful) < 570 cases (glmnet2008$obs successful).
Area under the curve: 0.91
>
> ## Sparse logistic regression
>
> set.seed(476)
> spLDAFit <- train(x = training[,fullSet],
+ y = training$Class,
+ "sparseLDA",
+ tuneGrid = expand.grid(lambda = c(.1),
+ NumVars = c(1:20, 50, 75, 100, 250, 500, 750, 1000)),
+ preProc = c("center", "scale"),
+ metric = "ROC",
+ trControl = ctrl)
Loading required package: sparseLDA
Loading required package: lars
Loaded lars 1.2
Loading required package: elasticnet
Loading required package: mda
> spLDAFit
Sparse Linear Discriminant Analysis
8190 samples
1071 predictors
2 classes: 'successful', 'unsuccessful'
Pre-processing: centered, scaled
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%)
Summary of sample sizes: 6633
Resampling results across tuning parameters:
NumVars ROC Sens Spec
1 0.815 0.991 0.638
2 0.823 0.991 0.638
3 0.865 0.991 0.638
4 0.868 0.96 0.663
5 0.886 0.961 0.67
6 0.901 0.921 0.719
7 0.899 0.891 0.751
8 0.898 0.888 0.754
9 0.897 0.886 0.751
10 0.897 0.886 0.751
11 0.897 0.886 0.751
12 0.897 0.886 0.754
13 0.897 0.886 0.755
14 0.897 0.886 0.755
15 0.897 0.886 0.755
16 0.897 0.886 0.756
17 0.897 0.884 0.764
18 0.897 0.884 0.765
19 0.897 0.882 0.766
20 0.897 0.882 0.765
50 0.899 0.877 0.78
75 0.9 0.877 0.785
100 0.901 0.875 0.787
250 0.9 0.856 0.797
500 0.89 0.837 0.8
750 0.878 0.818 0.799
1000 0.864 0.802 0.798
Tuning parameter 'lambda' was held constant at a value of 0.1
ROC was used to select the optimal model using the largest value.
The final values used for the model were NumVars = 6 and lambda = 0.1.
>
> spLDA2008 <- merge(spLDAFit$pred, spLDAFit$bestTune)
> spLDACM <- confusionMatrix(spLDAFit, norm = "none")
> spLDACM
Repeated Train/Test Splits Estimated (1 reps, 0.75%) Confusion Matrix
(entries are un-normalized counts)
Confusion Matrix and Statistics
Reference
Prediction successful unsuccessful
successful 525 277
unsuccessful 45 710
Accuracy : 0.7932
95% CI : (0.7722, 0.8131)
No Information Rate : 0.6339
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.5897
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.9211
Specificity : 0.7194
Pos Pred Value : 0.6546
Neg Pred Value : 0.9404
Prevalence : 0.3661
Detection Rate : 0.3372
Detection Prevalence : 0.5151
Balanced Accuracy : 0.8202
'Positive' Class : successful
>
> spLDARoc <- roc(response = spLDA2008$obs,
+ predictor = spLDA2008$successful,
+ levels = rev(levels(spLDA2008$obs)))
>
> update(plot(spLDAFit, scales = list(x = list(log = 10))),
+ ylab = "ROC AUC (2008 Hold-Out Data)")
>
> plot(plsRoc2, type = "s", col = rgb(.2, .2, .2, .2), legacy.axes = TRUE)
Call:
roc.default(response = plsFit2$pred$obs, predictor = plsFit2$pred$successful, levels = rev(levels(plsFit2$pred$obs)))
Data: plsFit2$pred$successful in 987 controls (plsFit2$pred$obs unsuccessful) < 570 cases (plsFit2$pred$obs successful).
Area under the curve: 0.895
> plot(glmnetRoc, type = "s", add = TRUE, col = rgb(.2, .2, .2, .2), legacy.axes = TRUE)
Call:
roc.default(response = glmnet2008$obs, predictor = glmnet2008$successful, levels = rev(levels(glmnet2008$obs)))
Data: glmnet2008$successful in 987 controls (glmnet2008$obs unsuccessful) < 570 cases (glmnet2008$obs successful).
Area under the curve: 0.91
> plot(ldaRoc, type = "s", add = TRUE, col = rgb(.2, .2, .2, .2), legacy.axes = TRUE)
Call:
roc.default(response = ldaFit$pred$obs, predictor = ldaFit$pred$successful, levels = rev(levels(ldaFit$pred$obs)))
Data: ldaFit$pred$successful in 987 controls (ldaFit$pred$obs unsuccessful) < 570 cases (ldaFit$pred$obs successful).
Area under the curve: 0.8892
> plot(lrRoc, type = "s", col = rgb(.2, .2, .2, .2), add = TRUE, legacy.axes = TRUE)
Call:
roc.default(response = lrFit$pred$obs, predictor = lrFit$pred$successful, levels = rev(levels(lrFit$pred$obs)))
Data: lrFit$pred$successful in 987 controls (lrFit$pred$obs unsuccessful) < 570 cases (lrFit$pred$obs successful).
Area under the curve: 0.8715
> plot(spLDARoc, type = "s", add = TRUE, legacy.axes = TRUE)
Call:
roc.default(response = spLDA2008$obs, predictor = spLDA2008$successful, levels = rev(levels(spLDA2008$obs)))
Data: spLDA2008$successful in 987 controls (spLDA2008$obs unsuccessful) < 570 cases (spLDA2008$obs successful).
Area under the curve: 0.9015
>
> ################################################################################
> ### Section 12.6 Nearest Shrunken Centroids
>
> set.seed(476)
> nscFit <- train(x = training[,fullSet],
+ y = training$Class,
+ method = "pam",
+ preProc = c("center", "scale"),
+ tuneGrid = data.frame(threshold = seq(0, 25, length = 30)),
+ metric = "ROC",
+ trControl = ctrl)
Loading required package: pamr
Loading required package: cluster
Loading required package: survival
Loading required package: splines
Attaching package: 'survival'
The following object is masked from 'package:caret':
cluster
11> nscFit
Nearest Shrunken Centroids
8190 samples
1071 predictors
2 classes: 'successful', 'unsuccessful'
Pre-processing: centered, scaled
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%)
Summary of sample sizes: 6633
Resampling results across tuning parameters:
threshold ROC Sens Spec
0 0.827 0.784 0.733
0.862 0.864 0.842 0.73
1.72 0.871 0.865 0.736
2.59 0.873 0.861 0.744
3.45 0.873 0.849 0.752
4.31 0.868 0.823 0.754
5.17 0.866 0.821 0.753
6.03 0.862 0.856 0.732
6.9 0.852 0.844 0.721
7.76 0.857 0.935 0.675
8.62 0.872 0.991 0.638
9.48 0.832 0.991 0.638
10.3 0.823 0.991 0.638
11.2 0.815 0.991 0.638
12.1 0.815 0.991 0.638
12.9 0.815 0.991 0.638
13.8 0.815 0 1
14.7 0.815 0 1
15.5 0.815 0 1
16.4 0.815 0 1
17.2 0.815 0 1
18.1 0.5 0 1
19 0.5 0 1
19.8 0.5 0 1
20.7 0.5 0 1
21.6 0.5 0 1
22.4 0.5 0 1
23.3 0.5 0 1
24.1 0.5 0 1
25 0.5 0 1
ROC was used to select the optimal model using the largest value.
The final value used for the model was threshold = 2.59.
>
> nsc2008 <- merge(nscFit$pred, nscFit$bestTune)
> nscCM <- confusionMatrix(nscFit, norm = "none")
> nscCM
Repeated Train/Test Splits Estimated (1 reps, 0.75%) Confusion Matrix
(entries are un-normalized counts)
Confusion Matrix and Statistics
Reference
Prediction successful unsuccessful
successful 491 253
unsuccessful 79 734
Accuracy : 0.7868
95% CI : (0.7656, 0.8069)
No Information Rate : 0.6339
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.5684
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.8614
Specificity : 0.7437
Pos Pred Value : 0.6599
Neg Pred Value : 0.9028
Prevalence : 0.3661
Detection Rate : 0.3154
Detection Prevalence : 0.4778
Balanced Accuracy : 0.8025
'Positive' Class : successful
> nscRoc <- roc(response = nsc2008$obs,
+ predictor = nsc2008$successful,
+ levels = rev(levels(nsc2008$obs)))
> update(plot(nscFit), ylab = "ROC AUC (2008 Hold-Out Data)")
>
>
> plot(plsRoc2, type = "s", col = rgb(.2, .2, .2, .2), legacy.axes = TRUE)
Call:
roc.default(response = plsFit2$pred$obs, predictor = plsFit2$pred$successful, levels = rev(levels(plsFit2$pred$obs)))
Data: plsFit2$pred$successful in 987 controls (plsFit2$pred$obs unsuccessful) < 570 cases (plsFit2$pred$obs successful).
Area under the curve: 0.895
> plot(glmnetRoc, type = "s", add = TRUE, col = rgb(.2, .2, .2, .2), legacy.axes = TRUE)
Call:
roc.default(response = glmnet2008$obs, predictor = glmnet2008$successful, levels = rev(levels(glmnet2008$obs)))
Data: glmnet2008$successful in 987 controls (glmnet2008$obs unsuccessful) < 570 cases (glmnet2008$obs successful).
Area under the curve: 0.91
> plot(ldaRoc, type = "s", add = TRUE, col = rgb(.2, .2, .2, .2), legacy.axes = TRUE)
Call:
roc.default(response = ldaFit$pred$obs, predictor = ldaFit$pred$successful, levels = rev(levels(ldaFit$pred$obs)))
Data: ldaFit$pred$successful in 987 controls (ldaFit$pred$obs unsuccessful) < 570 cases (ldaFit$pred$obs successful).
Area under the curve: 0.8892
> plot(lrRoc, type = "s", col = rgb(.2, .2, .2, .2), add = TRUE, legacy.axes = TRUE)
Call:
roc.default(response = lrFit$pred$obs, predictor = lrFit$pred$successful, levels = rev(levels(lrFit$pred$obs)))
Data: lrFit$pred$successful in 987 controls (lrFit$pred$obs unsuccessful) < 570 cases (lrFit$pred$obs successful).
Area under the curve: 0.8715
> plot(spLDARoc, type = "s", col = rgb(.2, .2, .2, .2), add = TRUE, legacy.axes = TRUE)
Call:
roc.default(response = spLDA2008$obs, predictor = spLDA2008$successful, levels = rev(levels(spLDA2008$obs)))
Data: spLDA2008$successful in 987 controls (spLDA2008$obs unsuccessful) < 570 cases (spLDA2008$obs successful).
Area under the curve: 0.9015
> plot(nscRoc, type = "s", add = TRUE, legacy.axes = TRUE)
Call:
roc.default(response = nsc2008$obs, predictor = nsc2008$successful, levels = rev(levels(nsc2008$obs)))
Data: nsc2008$successful in 987 controls (nsc2008$obs unsuccessful) < 570 cases (nsc2008$obs successful).
Area under the curve: 0.8733
>
> sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
locale:
[1] C
attached base packages:
[1] splines parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] pamr_1.54 survival_2.37-4 cluster_1.14.4 sparseLDA_0.1-6
[5] mda_0.4-2 elasticnet_1.1 lars_1.2 glmnet_1.9-3
[9] Matrix_1.0-12 klaR_0.6-8 pls_2.3-0 MASS_7.3-26
[13] e1071_1.6-1 class_7.3-7 pROC_1.5.4 kernlab_0.9-18
[17] reshape2_1.2.2 plyr_1.8 doMC_1.3.0 iterators_1.0.6
[21] foreach_1.4.0 caret_6.0-22 ggplot2_0.9.3.1 lattice_0.20-15
loaded via a namespace (and not attached):
[1] RColorBrewer_1.0-5 car_2.0-17 codetools_0.2-8 colorspace_1.2-2
[5] compiler_3.0.1 dichromat_2.0-0 digest_0.6.3 grid_3.0.1
[9] gtable_0.1.2 labeling_0.1 munsell_0.4 proto_0.3-10
[13] scales_0.2.3 stringr_0.6.2
>
> q("no")
> proc.time()
user system elapsed
376332.996 8337.928 35694.682
%%R -w 600 -h 600
## runChapterScript(12)
## user system elapsed
## 376332.996 8337.928 35694.682
NULL
%%R
showChapterScript(13)
NULL
%%R
showChapterOutput(13)
R Information
R version 3.0.1 (2013-05-16) -- "Good Sport"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin10.8.0 (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> ################################################################################
> ### R code from Applied Predictive Modeling (2013) by Kuhn and Johnson.
> ### Copyright 2013 Kuhn and Johnson
> ### Web Page: http://www.appliedpredictivemodeling.com
> ### Contact: Max Kuhn (mxkuhn@gmail.com)
> ###
> ### Chapter 13 Non-Linear Classification Models
> ###
> ### Required packages: AppliedPredictiveModeling, caret, doMC (optional)
> ### kernlab, klaR, lattice, latticeExtra, MASS, mda, nnet,
> ### pROC
> ###
> ### Data used: The grant application data. See the file 'CreateGrantData.R'
> ###
> ### Notes:
> ### 1) This code is provided without warranty.
> ###
> ### 2) This code should help the user reproduce the results in the
> ### text. There will be differences between this code and what is is
> ### the computing section. For example, the computing sections show
> ### how the source functions work (e.g. randomForest() or plsr()),
> ### which were not directly used when creating the book. Also, there may be
> ### syntax differences that occur over time as packages evolve. These files
> ### will reflect those changes.
> ###
> ### 3) In some cases, the calculations in the book were run in
> ### parallel. The sub-processes may reset the random number seed.
> ### Your results may slightly vary.
> ###
> ################################################################################
>
> ################################################################################
> ### Section 13.1 Nonlinear Discriminant Analysis
>
>
> load("grantData.RData")
>
> library(caret)
Loading required package: lattice
Loading required package: ggplot2
>
> ### Optional: parallel processing can be used via the 'do' packages,
> ### such as doMC, doMPI etc. We used doMC (not on Windows) to speed
> ### up the computations.
>
> ### WARNING: Be aware of how much memory is needed to parallel
> ### process. It can very quickly overwhelm the available hardware. We
> ### estimate the memory usage (VSIZE = total memory size) to be
> ### 2700M/core.
>
> library(doMC)
Loading required package: foreach
Loading required package: iterators
Loading required package: parallel
> registerDoMC(12)
>
> ## This control object will be used across multiple models so that the
> ## data splitting is consistent
>
> ctrl <- trainControl(method = "LGOCV",
+ summaryFunction = twoClassSummary,
+ classProbs = TRUE,
+ index = list(TrainSet = pre2008),
+ savePredictions = TRUE)
>
> set.seed(476)
> mdaFit <- train(x = training[,reducedSet],
+ y = training$Class,
+ method = "mda",
+ metric = "ROC",
+ tries = 40,
+ tuneGrid = expand.grid(subclasses = 1:8),
+ trControl = ctrl)
Loading required package: mda
Loading required package: class
Loading required package: pROC
Loading required package: plyr
Type 'citation("pROC")' for a citation.
Attaching package: ‘pROC’
The following object is masked from ‘package:stats’:
cov, smooth, var
> mdaFit
Mixture Discriminant Analysis
8190 samples
252 predictors
2 classes: 'successful', 'unsuccessful'
No pre-processing
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%)
Summary of sample sizes: 6633
Resampling results across tuning parameters:
subclasses ROC Sens Spec
1 0.887 0.811 0.822
2 0.865 0.789 0.813
3 0.831 0.835 0.726
4 0.852 0.732 0.82
5 0.842 0.733 0.797
6 0.822 0.733 0.782
7 0.836 0.823 0.734
8 0.791 0.649 0.851
ROC was used to select the optimal model using the largest value.
The final value used for the model was subclasses = 1.
>
> mdaFit$results <- mdaFit$results[!is.na(mdaFit$results$ROC),]
> mdaFit$pred <- merge(mdaFit$pred, mdaFit$bestTune)
> mdaCM <- confusionMatrix(mdaFit, norm = "none")
> mdaCM
Repeated Train/Test Splits Estimated (1 reps, 0.75%) Confusion Matrix
(entries are un-normalized counts)
Confusion Matrix and Statistics
Reference
Prediction successful unsuccessful
successful 462 176
unsuccessful 108 811
Accuracy : 0.8176
95% CI : (0.7975, 0.8365)
No Information Rate : 0.6339
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.6167
Mcnemar's Test P-Value : 7.017e-05
Sensitivity : 0.8105
Specificity : 0.8217
Pos Pred Value : 0.7241
Neg Pred Value : 0.8825
Prevalence : 0.3661
Detection Rate : 0.2967
Detection Prevalence : 0.4098
Balanced Accuracy : 0.8161
'Positive' Class : successful
>
> mdaRoc <- roc(response = mdaFit$pred$obs,
+ predictor = mdaFit$pred$successful,
+ levels = rev(levels(mdaFit$pred$obs)))
> mdaRoc
Call:
roc.default(response = mdaFit$pred$obs, predictor = mdaFit$pred$successful, levels = rev(levels(mdaFit$pred$obs)))
Data: mdaFit$pred$successful in 987 controls (mdaFit$pred$obs unsuccessful) < 570 cases (mdaFit$pred$obs successful).
Area under the curve: 0.8874
>
> update(plot(mdaFit,
+ ylab = "ROC AUC (2008 Hold-Out Data)"))
>
> ################################################################################
> ### Section 13.2 Neural Networks
>
> nnetGrid <- expand.grid(size = 1:10, decay = c(0, .1, 1, 2))
> maxSize <- max(nnetGrid$size)
>
>
> ## Four different models are evaluate based on the data pre-processing and
> ## whethera single or multiple models are used
>
> set.seed(476)
> nnetFit <- train(x = training[,reducedSet],
+ y = training$Class,
+ method = "nnet",
+ metric = "ROC",
+ preProc = c("center", "scale"),
+ tuneGrid = nnetGrid,
+ trace = FALSE,
+ maxit = 2000,
+ MaxNWts = 1*(maxSize * (length(reducedSet) + 1) + maxSize + 1),
+ trControl = ctrl)
Loading required package: nnet
> nnetFit
Neural Network
8190 samples
252 predictors
2 classes: 'successful', 'unsuccessful'
Pre-processing: centered, scaled
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%)
Summary of sample sizes: 6633
Resampling results across tuning parameters:
size decay ROC Sens Spec
1 0 0.778 0.765 0.791
1 0.1 0.845 0.793 0.794
1 1 0.844 0.795 0.79
1 2 0.853 0.782 0.811
2 0 0.804 0.811 0.753
2 0.1 0.846 0.807 0.806
2 1 0.86 0.73 0.841
2 2 0.864 0.758 0.834
3 0 0.841 0.805 0.757
3 0.1 0.822 0.786 0.728
3 1 0.857 0.73 0.833
3 2 0.859 0.747 0.81
4 0 0.828 0.795 0.754
4 0.1 0.854 0.74 0.814
4 1 0.869 0.802 0.796
4 2 0.864 0.779 0.785
5 0 0.819 0.767 0.719
5 0.1 0.843 0.786 0.787
5 1 0.845 0.716 0.817
5 2 0.851 0.728 0.829
6 0 0.844 0.728 0.806
6 0.1 0.8 0.693 0.775
6 1 0.848 0.782 0.778
6 2 0.869 0.777 0.82
7 0 0.833 0.807 0.757
7 0.1 0.806 0.728 0.768
7 1 0.831 0.746 0.777
7 2 0.863 0.758 0.822
8 0 0.833 0.761 0.784
8 0.1 0.847 0.751 0.78
8 1 0.857 0.753 0.803
8 2 0.866 0.77 0.814
9 0 0.848 0.784 0.789
9 0.1 0.836 0.719 0.798
9 1 0.843 0.753 0.781
9 2 0.854 0.746 0.803
10 0 0.806 0.707 0.779
10 0.1 0.82 0.726 0.76
10 1 0.846 0.73 0.807
10 2 0.863 0.749 0.817
ROC was used to select the optimal model using the largest value.
The final values used for the model were size = 4 and decay = 1.
>
> set.seed(476)
> nnetFit2 <- train(x = training[,reducedSet],
+ y = training$Class,
+ method = "nnet",
+ metric = "ROC",
+ preProc = c("center", "scale", "spatialSign"),
+ tuneGrid = nnetGrid,
+ trace = FALSE,
+ maxit = 2000,
+ MaxNWts = 1*(maxSize * (length(reducedSet) + 1) + maxSize + 1),
+ trControl = ctrl)
> nnetFit2
Neural Network
8190 samples
252 predictors
2 classes: 'successful', 'unsuccessful'
Pre-processing: centered, scaled, spatial sign transformation
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%)
Summary of sample sizes: 6633
Resampling results across tuning parameters:
size decay ROC Sens Spec
1 0 0.782 0.782 0.78
1 0.1 0.863 0.784 0.809
1 1 0.874 0.807 0.805
1 2 0.88 0.804 0.807
2 0 0.776 0.804 0.711
2 0.1 0.892 0.767 0.861
2 1 0.897 0.804 0.839
2 2 0.881 0.805 0.811
3 0 0.841 0.653 0.876
3 0.1 0.887 0.737 0.851
3 1 0.898 0.805 0.851
3 2 0.884 0.805 0.812
4 0 0.786 0.756 0.715
4 0.1 0.871 0.716 0.829
4 1 0.899 0.793 0.84
4 2 0.883 0.804 0.812
5 0 0.862 0.867 0.705
5 0.1 0.858 0.718 0.836
5 1 0.902 0.788 0.857
5 2 0.883 0.804 0.812
6 0 0.808 0.691 0.796
6 0.1 0.859 0.712 0.844
6 1 0.896 0.795 0.842
6 2 0.883 0.804 0.812
7 0 0.807 0.732 0.782
7 0.1 0.843 0.693 0.829
7 1 0.902 0.789 0.857
7 2 0.883 0.804 0.813
8 0 0.73 0.661 0.795
8 0.1 0.858 0.681 0.834
8 1 0.903 0.791 0.853
8 2 0.883 0.804 0.813
9 0 0.857 0.779 0.804
9 0.1 0.87 0.739 0.833
9 1 0.902 0.788 0.857
9 2 0.883 0.804 0.813
10 0 0.788 0.684 0.823
10 0.1 0.876 0.721 0.845
10 1 0.897 0.796 0.842
10 2 0.883 0.804 0.813
ROC was used to select the optimal model using the largest value.
The final values used for the model were size = 8 and decay = 1.
>
> nnetGrid$bag <- FALSE
>
> set.seed(476)
> nnetFit3 <- train(x = training[,reducedSet],
+ y = training$Class,
+ method = "avNNet",
+ metric = "ROC",
+ preProc = c("center", "scale"),
+ tuneGrid = nnetGrid,
+ repeats = 10,
+ trace = FALSE,
+ maxit = 2000,
+ MaxNWts = 10*(maxSize * (length(reducedSet) + 1) + maxSize + 1),
+ allowParallel = FALSE, ## this will cause to many workers to be launched.
+ trControl = ctrl)
> nnetFit3
Model Averaged Neural Network
8190 samples
252 predictors
2 classes: 'successful', 'unsuccessful'
Pre-processing: centered, scaled
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%)
Summary of sample sizes: 6633
Resampling results across tuning parameters:
size decay ROC Sens Spec
1 0 0.884 0.867 0.762
1 0.1 0.868 0.779 0.812
1 1 0.847 0.774 0.812
1 2 0.849 0.777 0.81
2 0 0.892 0.825 0.791
2 0.1 0.886 0.784 0.854
2 1 0.895 0.788 0.844
2 2 0.895 0.796 0.845
3 0 0.887 0.826 0.793
3 0.1 0.882 0.8 0.825
3 1 0.89 0.795 0.842
3 2 0.899 0.798 0.838
4 0 0.883 0.821 0.805
4 0.1 0.887 0.8 0.821
4 1 0.899 0.781 0.853
4 2 0.902 0.798 0.86
5 0 0.886 0.83 0.79
5 0.1 0.874 0.788 0.824
5 1 0.901 0.8 0.844
5 2 0.9 0.8 0.851
6 0 0.885 0.819 0.807
6 0.1 0.882 0.789 0.827
6 1 0.893 0.786 0.854
6 2 0.9 0.8 0.849
7 0 0.881 0.832 0.761
7 0.1 0.883 0.791 0.821
7 1 0.898 0.811 0.834
7 2 0.899 0.807 0.859
8 0 0.889 0.818 0.793
8 0.1 0.88 0.786 0.83
8 1 0.891 0.8 0.823
8 2 0.901 0.791 0.845
9 0 0.887 0.8 0.806
9 0.1 0.889 0.786 0.817
9 1 0.894 0.791 0.848
9 2 0.9 0.802 0.836
10 0 0.883 0.811 0.805
10 0.1 0.881 0.784 0.825
10 1 0.898 0.793 0.844
10 2 0.896 0.802 0.839
Tuning parameter 'bag' was held constant at a value of FALSE
ROC was used to select the optimal model using the largest value.
The final values used for the model were size = 4, decay = 2 and bag = FALSE.
>
> set.seed(476)
> nnetFit4 <- train(x = training[,reducedSet],
+ y = training$Class,
+ method = "avNNet",
+ metric = "ROC",
+ preProc = c("center", "scale", "spatialSign"),
+ tuneGrid = nnetGrid,
+ trace = FALSE,
+ maxit = 2000,
+ repeats = 10,
+ MaxNWts = 10*(maxSize * (length(reducedSet) + 1) + maxSize + 1),
+ allowParallel = FALSE,
+ trControl = ctrl)
> nnetFit4
Model Averaged Neural Network
8190 samples
252 predictors
2 classes: 'successful', 'unsuccessful'
Pre-processing: centered, scaled, spatial sign transformation
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%)
Summary of sample sizes: 6633
Resampling results across tuning parameters:
size decay ROC Sens Spec
1 0 0.867 0.784 0.8
1 0.1 0.857 0.782 0.81
1 1 0.874 0.804 0.807
1 2 0.882 0.795 0.802
2 0 0.882 0.782 0.845
2 0.1 0.897 0.754 0.872
2 1 0.875 0.798 0.811
2 2 0.881 0.795 0.804
3 0 0.89 0.796 0.833
3 0.1 0.907 0.788 0.864
3 1 0.876 0.795 0.81
3 2 0.881 0.795 0.804
4 0 0.889 0.795 0.838
4 0.1 0.911 0.782 0.867
4 1 0.874 0.798 0.809
4 2 0.881 0.795 0.805
5 0 0.893 0.786 0.861
5 0.1 0.909 0.779 0.87
5 1 0.875 0.796 0.809
5 2 0.881 0.795 0.805
6 0 0.893 0.786 0.848
6 0.1 0.904 0.754 0.865
6 1 0.876 0.793 0.81
6 2 0.881 0.795 0.805
7 0 0.89 0.782 0.849
7 0.1 0.905 0.76 0.866
7 1 0.881 0.796 0.817
7 2 0.881 0.795 0.805
8 0 0.898 0.795 0.856
8 0.1 0.904 0.756 0.865
8 1 0.878 0.795 0.813
8 2 0.881 0.795 0.805
9 0 0.893 0.782 0.857
9 0.1 0.902 0.761 0.869
9 1 0.878 0.795 0.813
9 2 0.881 0.795 0.805
10 0 0.895 0.786 0.865
10 0.1 0.901 0.76 0.858
10 1 0.878 0.795 0.814
10 2 0.881 0.795 0.806
Tuning parameter 'bag' was held constant at a value of FALSE
ROC was used to select the optimal model using the largest value.
The final values used for the model were size = 4, decay = 0.1 and bag = FALSE.
>
> nnetFit4$pred <- merge(nnetFit4$pred, nnetFit4$bestTune)
> nnetCM <- confusionMatrix(nnetFit4, norm = "none")
> nnetCM
Repeated Train/Test Splits Estimated (1 reps, 0.75%) Confusion Matrix
(entries are un-normalized counts)
Confusion Matrix and Statistics
Reference
Prediction successful unsuccessful
successful 446 131
unsuccessful 124 856
Accuracy : 0.8362
95% CI : (0.8169, 0.8543)
No Information Rate : 0.6339
P-Value [Acc > NIR] : <2e-16
Kappa : 0.648
Mcnemar's Test P-Value : 0.7071
Sensitivity : 0.7825
Specificity : 0.8673
Pos Pred Value : 0.7730
Neg Pred Value : 0.8735
Prevalence : 0.3661
Detection Rate : 0.2864
Detection Prevalence : 0.3706
Balanced Accuracy : 0.8249
'Positive' Class : successful
>
> nnetRoc <- roc(response = nnetFit4$pred$obs,
+ predictor = nnetFit4$pred$successful,
+ levels = rev(levels(nnetFit4$pred$obs)))
>
>
> nnet1 <- nnetFit$results
> nnet1$Transform <- "No Transformation"
> nnet1$Model <- "Single Model"
>
> nnet2 <- nnetFit2$results
> nnet2$Transform <- "Spatial Sign"
> nnet2$Model <- "Single Model"
>
> nnet3 <- nnetFit3$results
> nnet3$Transform <- "No Transformation"
> nnet3$Model <- "Model Averaging"
> nnet3$bag <- NULL
>
> nnet4 <- nnetFit4$results
> nnet4$Transform <- "Spatial Sign"
> nnet4$Model <- "Model Averaging"
> nnet4$bag <- NULL
>
> nnetResults <- rbind(nnet1, nnet2, nnet3, nnet4)
> nnetResults$Model <- factor(as.character(nnetResults$Model),
+ levels = c("Single Model", "Model Averaging"))
> library(latticeExtra)
Loading required package: RColorBrewer
Attaching package: ‘latticeExtra’
The following object is masked from ‘package:ggplot2’:
layer
> useOuterStrips(
+ xyplot(ROC ~ size|Model*Transform,
+ data = nnetResults,
+ groups = decay,
+ as.table = TRUE,
+ type = c("p", "l", "g"),
+ lty = 1,
+ ylab = "ROC AUC (2008 Hold-Out Data)",
+ xlab = "Number of Hidden Units",
+ auto.key = list(columns = 4,
+ title = "Weight Decay",
+ cex.title = 1)))
>
> plot(nnetRoc, type = "s", legacy.axes = TRUE)
Call:
roc.default(response = nnetFit4$pred$obs, predictor = nnetFit4$pred$successful, levels = rev(levels(nnetFit4$pred$obs)))
Data: nnetFit4$pred$successful in 987 controls (nnetFit4$pred$obs unsuccessful) < 570 cases (nnetFit4$pred$obs successful).
Area under the curve: 0.9111
>
> ################################################################################
> ### Section 13.3 Flexible Discriminant Analysis
>
> set.seed(476)
> fdaFit <- train(x = training[,reducedSet],
+ y = training$Class,
+ method = "fda",
+ metric = "ROC",
+ tuneGrid = expand.grid(degree = 1, nprune = 2:25),
+ trControl = ctrl)
Loading required package: earth
Loading required package: leaps
Loading required package: plotmo
Loading required package: plotrix
> fdaFit
Flexible Discriminant Analysis
8190 samples
252 predictors
2 classes: 'successful', 'unsuccessful'
No pre-processing
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%)
Summary of sample sizes: 6633
Resampling results across tuning parameters:
nprune ROC Sens Spec
2 0.815 0.991 0.638
3 0.809 0.995 0.567
4 0.86 0.947 0.73
5 0.869 0.963 0.728
6 0.877 0.968 0.727
7 0.893 0.823 0.806
8 0.903 0.779 0.851
9 0.909 0.83 0.841
10 0.915 0.816 0.853
11 0.919 0.825 0.859
12 0.92 0.816 0.865
13 0.918 0.809 0.865
14 0.92 0.807 0.865
15 0.92 0.819 0.861
16 0.921 0.826 0.858
17 0.921 0.818 0.863
18 0.922 0.821 0.86
19 0.924 0.825 0.864
20 0.922 0.825 0.858
21 0.919 0.816 0.869
22 0.919 0.811 0.872
23 0.918 0.811 0.867
24 0.918 0.809 0.868
25 0.918 0.809 0.865
Tuning parameter 'degree' was held constant at a value of 1
ROC was used to select the optimal model using the largest value.
The final values used for the model were degree = 1 and nprune = 19.
>
> fdaFit$pred <- merge(fdaFit$pred, fdaFit$bestTune)
> fdaCM <- confusionMatrix(fdaFit, norm = "none")
> fdaCM
Repeated Train/Test Splits Estimated (1 reps, 0.75%) Confusion Matrix
(entries are un-normalized counts)
Confusion Matrix and Statistics
Reference
Prediction successful unsuccessful
successful 470 134
unsuccessful 100 853
Accuracy : 0.8497
95% CI : (0.831, 0.8671)
No Information Rate : 0.6339
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.6802
Mcnemar's Test P-Value : 0.03098
Sensitivity : 0.8246
Specificity : 0.8642
Pos Pred Value : 0.7781
Neg Pred Value : 0.8951
Prevalence : 0.3661
Detection Rate : 0.3019
Detection Prevalence : 0.3879
Balanced Accuracy : 0.8444
'Positive' Class : successful
>
> fdaRoc <- roc(response = fdaFit$pred$obs,
+ predictor = fdaFit$pred$successful,
+ levels = rev(levels(fdaFit$pred$obs)))
>
> update(plot(fdaFit), ylab = "ROC AUC (2008 Hold-Out Data)")
>
> plot(nnetRoc, type = "s", col = rgb(.2, .2, .2, .2), legacy.axes = TRUE)
Call:
roc.default(response = nnetFit4$pred$obs, predictor = nnetFit4$pred$successful, levels = rev(levels(nnetFit4$pred$obs)))
Data: nnetFit4$pred$successful in 987 controls (nnetFit4$pred$obs unsuccessful) < 570 cases (nnetFit4$pred$obs successful).
Area under the curve: 0.9111
> plot(fdaRoc, type = "s", add = TRUE, legacy.axes = TRUE)
Call:
roc.default(response = fdaFit$pred$obs, predictor = fdaFit$pred$successful, levels = rev(levels(fdaFit$pred$obs)))
Data: fdaFit$pred$successful in 987 controls (fdaFit$pred$obs unsuccessful) < 570 cases (fdaFit$pred$obs successful).
Area under the curve: 0.924
>
>
> ################################################################################
> ### Section 13.4 Support Vector Machines
>
> library(kernlab)
>
> set.seed(201)
> sigmaRangeFull <- sigest(as.matrix(training[,fullSet]))
> svmRGridFull <- expand.grid(sigma = as.vector(sigmaRangeFull)[1],
+ C = 2^(-3:4))
> set.seed(476)
> svmRFitFull <- train(x = training[,fullSet],
+ y = training$Class,
+ method = "svmRadial",
+ metric = "ROC",
+ preProc = c("center", "scale"),
+ tuneGrid = svmRGridFull,
+ trControl = ctrl)
> svmRFitFull
Support Vector Machines with Radial Basis Function Kernel
8190 samples
1070 predictors
2 classes: 'successful', 'unsuccessful'
Pre-processing: centered, scaled
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%)
Summary of sample sizes: 6633
Resampling results across tuning parameters:
C ROC Sens Spec
0.125 0.781 0.916 0.521
0.25 0.851 0.861 0.694
0.5 0.866 0.84 0.755
1 0.873 0.83 0.774
2 0.875 0.821 0.791
4 0.875 0.811 0.803
8 0.87 0.798 0.799
16 0.866 0.798 0.81
Tuning parameter 'sigma' was held constant at a value of 0.0002385724
ROC was used to select the optimal model using the largest value.
The final values used for the model were sigma = 0.000239 and C = 2.
>
> set.seed(202)
> sigmaRangeReduced <- sigest(as.matrix(training[,reducedSet]))
> svmRGridReduced <- expand.grid(sigma = sigmaRangeReduced[1],
+ C = 2^(seq(-4, 4)))
> set.seed(476)
> svmRFitReduced <- train(x = training[,reducedSet],
+ y = training$Class,
+ method = "svmRadial",
+ metric = "ROC",
+ preProc = c("center", "scale"),
+ tuneGrid = svmRGridReduced,
+ trControl = ctrl)
> svmRFitReduced
Support Vector Machines with Radial Basis Function Kernel
8190 samples
252 predictors
2 classes: 'successful', 'unsuccessful'
Pre-processing: centered, scaled
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%)
Summary of sample sizes: 6633
Resampling results across tuning parameters:
C ROC Sens Spec
0.0625 0.866 0.916 0.691
0.125 0.88 0.86 0.758
0.25 0.89 0.849 0.781
0.5 0.894 0.83 0.8
1 0.895 0.811 0.815
2 0.891 0.805 0.83
4 0.887 0.805 0.822
8 0.885 0.798 0.821
16 0.882 0.8 0.82
Tuning parameter 'sigma' was held constant at a value of 0.001166986
ROC was used to select the optimal model using the largest value.
The final values used for the model were sigma = 0.00117 and C = 1.
>
> svmPGrid <- expand.grid(degree = 1:2,
+ scale = c(0.01, .005),
+ C = 2^(seq(-6, -2, length = 10)))
>
> set.seed(476)
> svmPFitFull <- train(x = training[,fullSet],
+ y = training$Class,
+ method = "svmPoly",
+ metric = "ROC",
+ preProc = c("center", "scale"),
+ tuneGrid = svmPGrid,
+ trControl = ctrl)
> svmPFitFull
Support Vector Machines with Polynomial Kernel
8190 samples
1070 predictors
2 classes: 'successful', 'unsuccessful'
Pre-processing: centered, scaled
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%)
Summary of sample sizes: 6633
Resampling results across tuning parameters:
degree scale C ROC Sens Spec
1 0.005 0.0156 0.856 0.886 0.706
1 0.005 0.0213 0.861 0.87 0.733
1 0.005 0.0289 0.863 0.868 0.758
1 0.005 0.0394 0.867 0.863 0.768
1 0.005 0.0536 0.87 0.863 0.777
1 0.005 0.0729 0.872 0.856 0.782
1 0.005 0.0992 0.872 0.84 0.789
1 0.005 0.135 0.873 0.825 0.798
1 0.005 0.184 0.872 0.816 0.798
1 0.005 0.25 0.872 0.814 0.803
1 0.01 0.0156 0.864 0.868 0.758
1 0.01 0.0213 0.868 0.865 0.768
1 0.01 0.0289 0.87 0.861 0.78
1 0.01 0.0394 0.872 0.849 0.784
1 0.01 0.0536 0.873 0.84 0.79
1 0.01 0.0729 0.873 0.825 0.798
1 0.01 0.0992 0.872 0.814 0.801
1 0.01 0.135 0.871 0.812 0.802
1 0.01 0.184 0.868 0.812 0.795
1 0.01 0.25 0.862 0.791 0.795
2 0.005 0.0156 0.838 0.812 0.752
2 0.005 0.0213 0.845 0.816 0.766
2 0.005 0.0289 0.852 0.819 0.776
2 0.005 0.0394 0.856 0.819 0.78
2 0.005 0.0536 0.86 0.814 0.784
2 0.005 0.0729 0.865 0.825 0.782
2 0.005 0.0992 0.866 0.823 0.787
2 0.005 0.135 0.865 0.816 0.788
2 0.005 0.184 0.862 0.802 0.789
2 0.005 0.25 0.86 0.807 0.784
2 0.01 0.0156 0.845 0.816 0.765
2 0.01 0.0213 0.851 0.811 0.774
2 0.01 0.0289 0.856 0.811 0.778
2 0.01 0.0394 0.857 0.812 0.78
2 0.01 0.0536 0.855 0.809 0.779
2 0.01 0.0729 0.854 0.796 0.786
2 0.01 0.0992 0.854 0.789 0.783
2 0.01 0.135 0.852 0.788 0.78
2 0.01 0.184 0.851 0.782 0.778
2 0.01 0.25 0.85 0.784 0.78
ROC was used to select the optimal model using the largest value.
The final values used for the model were degree = 1, scale = 0.01 and C
= 0.0729.
>
> svmPGrid2 <- expand.grid(degree = 1:2,
+ scale = c(0.01, .005),
+ C = 2^(seq(-6, -2, length = 10)))
> set.seed(476)
> svmPFitReduced <- train(x = training[,reducedSet],
+ y = training$Class,
+ method = "svmPoly",
+ metric = "ROC",
+ preProc = c("center", "scale"),
+ tuneGrid = svmPGrid2,
+ fit = FALSE,
+ trControl = ctrl)
line search fails -2.047663 -0.1283902 1.181205e-05 2.17076e-06 -2.621876e-08 -4.051435e-09 -3.184921e-13Warning messages:
1: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
2: In train.default(x = training[, reducedSet], y = training$Class, :
missing values found in aggregated results
> svmPFitReduced
Support Vector Machines with Polynomial Kernel
8190 samples
252 predictors
2 classes: 'successful', 'unsuccessful'
Pre-processing: centered, scaled
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%)
Summary of sample sizes: 6633
Resampling results across tuning parameters:
degree scale C ROC Sens Spec
1 0.005 0.0156 0.867 0.94 0.653
1 0.005 0.0213 0.875 0.926 0.707
1 0.005 0.0289 0.881 0.912 0.738
1 0.005 0.0394 0.887 0.909 0.743
1 0.005 0.0536 0.892 0.904 0.762
1 0.005 0.0729 0.895 0.895 0.772
1 0.005 0.0992 0.897 0.863 0.781
1 0.005 0.135 0.896 0.854 0.797
1 0.005 0.184 0.896 0.849 0.804
1 0.005 0.25 0.895 0.844 0.811
1 0.01 0.0156 0.883 0.916 0.74
1 0.01 0.0213 0.888 0.911 0.749
1 0.01 0.0289 0.893 0.898 0.762
1 0.01 0.0394 0.896 0.886 0.775
1 0.01 0.0536 0.896 0.863 0.785
1 0.01 0.0729 0.897 0.853 0.8
1 0.01 0.0992 0.896 0.847 0.81
1 0.01 0.135 0.894 0.844 0.81
1 0.01 0.184 0.891 0.837 0.818
1 0.01 0.25 0.888 0.816 0.825
2 0.005 0.0156 0.88 0.902 0.746
2 0.005 0.0213 0.886 0.896 0.759
2 0.005 0.0289 0.89 0.879 0.774
2 0.005 0.0394 0.894 0.877 0.777
2 0.005 0.0536 0.896 0.854 0.794
2 0.005 0.0729 0.898 0.842 0.805
2 0.005 0.0992 0.898 0.83 0.815
2 0.005 0.135 0.896 0.828 0.828
2 0.005 0.184 0.896 0.819 0.828
2 0.005 0.25 0.893 0.818 0.828
2 0.01 0.0156 0.891 0.863 0.781
2 0.01 0.0213 0.894 0.856 0.788
2 0.01 0.0289 0.896 0.832 0.803
2 0.01 0.0394 0.897 0.826 0.81
2 0.01 0.0536 0.896 0.833 0.818
2 0.01 0.0729 0.893 0.819 0.821
2 0.01 0.0992 NaN NaN NaN
2 0.01 0.135 0.887 0.802 0.83
2 0.01 0.184 0.883 0.804 0.832
2 0.01 0.25 0.88 0.8 0.831
ROC was used to select the optimal model using the largest value.
The final values used for the model were degree = 2, scale = 0.005 and C
= 0.0729.
>
> svmPFitReduced$pred <- merge(svmPFitReduced$pred, svmPFitReduced$bestTune)
> svmPCM <- confusionMatrix(svmPFitReduced, norm = "none")
> svmPRoc <- roc(response = svmPFitReduced$pred$obs,
+ predictor = svmPFitReduced$pred$successful,
+ levels = rev(levels(svmPFitReduced$pred$obs)))
>
>
> svmRadialResults <- rbind(svmRFitReduced$results,
+ svmRFitFull$results)
> svmRadialResults$Set <- c(rep("Reduced Set", nrow(svmRFitReduced$result)),
+ rep("Full Set", nrow(svmRFitFull$result)))
> svmRadialResults$Sigma <- paste("sigma = ",
+ format(svmRadialResults$sigma,
+ scientific = FALSE, digits= 5))
> svmRadialResults <- svmRadialResults[!is.na(svmRadialResults$ROC),]
> xyplot(ROC ~ C|Set, data = svmRadialResults,
+ groups = Sigma, type = c("g", "o"),
+ xlab = "Cost",
+ ylab = "ROC (2008 Hold-Out Data)",
+ auto.key = list(columns = 2),
+ scales = list(x = list(log = 2)))
>
> svmPolyResults <- rbind(svmPFitReduced$results,
+ svmPFitFull$results)
> svmPolyResults$Set <- c(rep("Reduced Set", nrow(svmPFitReduced$result)),
+ rep("Full Set", nrow(svmPFitFull$result)))
> svmPolyResults <- svmPolyResults[!is.na(svmPolyResults$ROC),]
> svmPolyResults$scale <- paste("scale = ",
+ format(svmPolyResults$scale,
+ scientific = FALSE))
> svmPolyResults$Degree <- "Linear"
> svmPolyResults$Degree[svmPolyResults$degree == 2] <- "Quadratic"
> useOuterStrips(xyplot(ROC ~ C|Degree*Set, data = svmPolyResults,
+ groups = scale, type = c("g", "o"),
+ xlab = "Cost",
+ ylab = "ROC (2008 Hold-Out Data)",
+ auto.key = list(columns = 2),
+ scales = list(x = list(log = 2))))
>
> plot(nnetRoc, type = "s", col = rgb(.2, .2, .2, .2), legacy.axes = TRUE)
Call:
roc.default(response = nnetFit4$pred$obs, predictor = nnetFit4$pred$successful, levels = rev(levels(nnetFit4$pred$obs)))
Data: nnetFit4$pred$successful in 987 controls (nnetFit4$pred$obs unsuccessful) < 570 cases (nnetFit4$pred$obs successful).
Area under the curve: 0.9111
> plot(fdaRoc, type = "s", add = TRUE, col = rgb(.2, .2, .2, .2), legacy.axes = TRUE)
Call:
roc.default(response = fdaFit$pred$obs, predictor = fdaFit$pred$successful, levels = rev(levels(fdaFit$pred$obs)))
Data: fdaFit$pred$successful in 987 controls (fdaFit$pred$obs unsuccessful) < 570 cases (fdaFit$pred$obs successful).
Area under the curve: 0.924
> plot(svmPRoc, type = "s", add = TRUE, legacy.axes = TRUE)
Call:
roc.default(response = svmPFitReduced$pred$obs, predictor = svmPFitReduced$pred$successful, levels = rev(levels(svmPFitReduced$pred$obs)))
Data: svmPFitReduced$pred$successful in 987 controls (svmPFitReduced$pred$obs unsuccessful) < 570 cases (svmPFitReduced$pred$obs successful).
Area under the curve: 0.8982
>
> ################################################################################
> ### Section 13.5 K-Nearest Neighbors
>
>
> set.seed(476)
> knnFit <- train(x = training[,reducedSet],
+ y = training$Class,
+ method = "knn",
+ metric = "ROC",
+ preProc = c("center", "scale"),
+ tuneGrid = data.frame(k = c(4*(0:5)+1,20*(1:5)+1,50*(2:9)+1)),
+ trControl = ctrl)
> knnFit
k-Nearest Neighbors
8190 samples
252 predictors
2 classes: 'successful', 'unsuccessful'
Pre-processing: centered, scaled
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%)
Summary of sample sizes: 6633
Resampling results across tuning parameters:
k ROC Sens Spec ROC SD Sens SD Spec SD
1 0.622 0.547 0.694 NA NA NA
5 0.7 0.553 0.708 NA NA NA
9 0.706 0.542 0.746 NA NA NA
13 0.709 0.558 0.743 NA NA NA
17 0.711 0.565 0.737 NA NA NA
21 0.724 0.542 0.739 0 0 0.00143
41 0.734 0.575 0.757 NA NA NA
61 0.75 0.556 0.785 NA NA NA
81 0.762 0.535 0.811 NA NA NA
101 0.766 0.52 0.825 0 0.00124 0
151 0.773 0.454 0.866 NA NA NA
201 0.779 0.395 0.891 NA NA NA
251 0.781 0.351 0.897 NA NA NA
301 0.787 0.333 0.907 NA NA NA
351 0.792 0.312 0.906 NA NA NA
401 0.797 0.337 0.905 NA NA NA
451 0.807 0.353 0.908 NA NA NA
ROC was used to select the optimal model using the largest value.
The final value used for the model was k = 451.
>
> knnFit$pred <- merge(knnFit$pred, knnFit$bestTune)
> knnCM <- confusionMatrix(knnFit, norm = "none")
> knnCM
Repeated Train/Test Splits Estimated (1 reps, 0.75%) Confusion Matrix
(entries are un-normalized counts)
Confusion Matrix and Statistics
Reference
Prediction successful unsuccessful
successful 201 91
unsuccessful 369 896
Accuracy : 0.7046
95% CI : (0.6812, 0.7271)
No Information Rate : 0.6339
P-Value [Acc > NIR] : 2.461e-09
Kappa : 0.2903
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.3526
Specificity : 0.9078
Pos Pred Value : 0.6884
Neg Pred Value : 0.7083
Prevalence : 0.3661
Detection Rate : 0.1291
Detection Prevalence : 0.1875
Balanced Accuracy : 0.6302
'Positive' Class : successful
> knnRoc <- roc(response = knnFit$pred$obs,
+ predictor = knnFit$pred$successful,
+ levels = rev(levels(knnFit$pred$obs)))
>
> update(plot(knnFit, ylab = "ROC (2008 Hold-Out Data)"))
>
> plot(fdaRoc, type = "s", col = rgb(.2, .2, .2, .2), legacy.axes = TRUE)
Call:
roc.default(response = fdaFit$pred$obs, predictor = fdaFit$pred$successful, levels = rev(levels(fdaFit$pred$obs)))
Data: fdaFit$pred$successful in 987 controls (fdaFit$pred$obs unsuccessful) < 570 cases (fdaFit$pred$obs successful).
Area under the curve: 0.924
> plot(nnetRoc, type = "s", add = TRUE, col = rgb(.2, .2, .2, .2), legacy.axes = TRUE)
Call:
roc.default(response = nnetFit4$pred$obs, predictor = nnetFit4$pred$successful, levels = rev(levels(nnetFit4$pred$obs)))
Data: nnetFit4$pred$successful in 987 controls (nnetFit4$pred$obs unsuccessful) < 570 cases (nnetFit4$pred$obs successful).
Area under the curve: 0.9111
> plot(svmPRoc, type = "s", add = TRUE, col = rgb(.2, .2, .2, .2), legacy.axes = TRUE)
Call:
roc.default(response = svmPFitReduced$pred$obs, predictor = svmPFitReduced$pred$successful, levels = rev(levels(svmPFitReduced$pred$obs)))
Data: svmPFitReduced$pred$successful in 987 controls (svmPFitReduced$pred$obs unsuccessful) < 570 cases (svmPFitReduced$pred$obs successful).
Area under the curve: 0.8982
> plot(knnRoc, type = "s", add = TRUE, legacy.axes = TRUE)
Call:
roc.default(response = knnFit$pred$obs, predictor = knnFit$pred$successful, levels = rev(levels(knnFit$pred$obs)))
Data: knnFit$pred$successful in 987 controls (knnFit$pred$obs unsuccessful) < 570 cases (knnFit$pred$obs successful).
Area under the curve: 0.8068
>
> ################################################################################
> ### Section 13.6 Naive Bayes
>
> ## Create factor versions of some of the predictors so that they are treated
> ## as categories and not dummy variables
>
> factors <- c("SponsorCode", "ContractValueBand", "Month", "Weekday")
> nbPredictors <- factorPredictors[factorPredictors %in% reducedSet]
> nbPredictors <- c(nbPredictors, factors)
> nbPredictors <- nbPredictors[nbPredictors != "SponsorUnk"]
>
> nbTraining <- training[, c("Class", nbPredictors)]
> nbTesting <- testing[, c("Class", nbPredictors)]
>
> for(i in nbPredictors)
+ {
+ if(length(unique(training[,i])) <= 15)
+ {
+ nbTraining[, i] <- factor(nbTraining[,i], levels = paste(sort(unique(training[,i]))))
+ nbTesting[, i] <- factor(nbTesting[,i], levels = paste(sort(unique(training[,i]))))
+ }
+ }
>
> set.seed(476)
> nBayesFit <- train(x = nbTraining[,nbPredictors],
+ y = nbTraining$Class,
+ method = "nb",
+ metric = "ROC",
+ tuneGrid = data.frame(usekernel = c(TRUE, FALSE), fL = 2),
+ trControl = ctrl)
Loading required package: klaR
Loading required package: MASS
> nBayesFit
Naive Bayes
8190 samples
205 predictors
2 classes: 'successful', 'unsuccessful'
No pre-processing
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%)
Summary of sample sizes: 6633
Resampling results across tuning parameters:
usekernel ROC Sens Spec
FALSE 0.782 0.588 0.796
TRUE 0.814 0.644 0.824
Tuning parameter 'fL' was held constant at a value of 2
ROC was used to select the optimal model using the largest value.
The final values used for the model were fL = 2 and usekernel = TRUE.
>
> nBayesFit$pred <- merge(nBayesFit$pred, nBayesFit$bestTune)
> nBayesCM <- confusionMatrix(nBayesFit, norm = "none")
> nBayesCM
Repeated Train/Test Splits Estimated (1 reps, 0.75%) Confusion Matrix
(entries are un-normalized counts)
Confusion Matrix and Statistics
Reference
Prediction successful unsuccessful
successful 367 174
unsuccessful 203 813
Accuracy : 0.7579
95% CI : (0.7358, 0.779)
No Information Rate : 0.6339
P-Value [Acc > NIR] : <2e-16
Kappa : 0.4726
Mcnemar's Test P-Value : 0.1493
Sensitivity : 0.6439
Specificity : 0.8237
Pos Pred Value : 0.6784
Neg Pred Value : 0.8002
Prevalence : 0.3661
Detection Rate : 0.2357
Detection Prevalence : 0.3475
Balanced Accuracy : 0.7338
'Positive' Class : successful
> nBayesRoc <- roc(response = nBayesFit$pred$obs,
+ predictor = nBayesFit$pred$successful,
+ levels = rev(levels(nBayesFit$pred$obs)))
> nBayesRoc
Call:
roc.default(response = nBayesFit$pred$obs, predictor = nBayesFit$pred$successful, levels = rev(levels(nBayesFit$pred$obs)))
Data: nBayesFit$pred$successful in 987 controls (nBayesFit$pred$obs unsuccessful) < 570 cases (nBayesFit$pred$obs successful).
Area under the curve: 0.8137
>
>
> sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] klaR_0.6-7 MASS_7.3-26 kernlab_0.9-16
[4] earth_3.2-3 plotrix_3.4-6 plotmo_1.3-2
[7] leaps_2.9 latticeExtra_0.6-24 RColorBrewer_1.0-5
[10] nnet_7.3-6 e1071_1.6-1 pROC_1.5.4
[13] plyr_1.8 mda_0.4-2 class_7.3-7
[16] doMC_1.3.0 iterators_1.0.6 foreach_1.4.0
[19] caret_6.0-22 ggplot2_0.9.3.1 lattice_0.20-15
loaded via a namespace (and not attached):
[1] car_2.0-16 codetools_0.2-8 colorspace_1.2-1 compiler_3.0.1
[5] dichromat_2.0-0 digest_0.6.3 grid_3.0.1 gtable_0.1.2
[9] labeling_0.1 munsell_0.4 proto_0.3-10 reshape2_1.2.2
[13] scales_0.2.3 stringr_0.6.2
>
> q("no")
> proc.time()
user system elapsed
313451.24 2270.67 52861.72
%%R -w 600 -h 600
## runChapterScript(13)
## user system elapsed
## 313451.24 2270.67 52861.72
NULL
%%R
showChapterScript(14)
NULL
%%R
showChapterOutput(14)
R Information
R version 3.0.1 (2013-05-16) -- "Good Sport"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin10.8.0 (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> ################################################################################
> ### R code from Applied Predictive Modeling (2013) by Kuhn and Johnson.
> ### Copyright 2013 Kuhn and Johnson
> ### Web Page: http://www.appliedpredictivemodeling.com
> ### Contact: Max Kuhn (mxkuhn@gmail.com)
> ###
> ### Chapter 14 Classification Trees and Rule Based Models
> ###
> ### Required packages: AppliedPredictiveModeling, C50, caret, doMC (optional),
> ### gbm, lattice, partykit, pROC, randomForest, reshape2,
> ### rpart, RWeka
> ###
> ### Data used: The grant application data. See the file 'CreateGrantData.R'
> ###
> ### Notes:
> ### 1) This code is provided without warranty.
> ###
> ### 2) This code should help the user reproduce the results in the
> ### text. There will be differences between this code and what is is
> ### the computing section. For example, the computing sections show
> ### how the source functions work (e.g. randomForest() or plsr()),
> ### which were not directly used when creating the book. Also, there may be
> ### syntax differences that occur over time as packages evolve. These files
> ### will reflect those changes.
> ###
> ### 3) In some cases, the calculations in the book were run in
> ### parallel. The sub-processes may reset the random number seed.
> ### Your results may slightly vary.
> ###
> ################################################################################
>
> ### NOTE: Many of the models here are computationally expensive. If
> ### this script is run as-is, the memory requirements will accumulate
> ### until it exceeds 32gb.
>
> ################################################################################
> ### Section 14.1 Basic Classification Trees
>
> library(caret)
Loading required package: lattice
Loading required package: ggplot2
>
> load("grantData.RData")
>
> ctrl <- trainControl(method = "LGOCV",
+ summaryFunction = twoClassSummary,
+ classProbs = TRUE,
+ index = list(TrainSet = pre2008),
+ savePredictions = TRUE)
>
> set.seed(476)
> rpartFit <- train(x = training[,fullSet],
+ y = training$Class,
+ method = "rpart",
+ tuneLength = 30,
+ metric = "ROC",
+ trControl = ctrl)
Loading required package: rpart
Loading required package: pROC
Loading required package: plyr
Type 'citation("pROC")' for a citation.
Attaching package: 'pROC'
The following object is masked from 'package:stats':
cov, smooth, var
> rpartFit
CART
8190 samples
1070 predictors
2 classes: 'successful', 'unsuccessful'
No pre-processing
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%)
Summary of sample sizes: 6633
Resampling results across tuning parameters:
cp ROC Sens Spec
0.000351 0.895 0.779 0.837
0.000394 0.895 0.779 0.837
0.000526 0.896 0.804 0.841
0.000657 0.897 0.823 0.83
0.000789 0.897 0.793 0.839
0.000877 0.897 0.877 0.818
0.000894 0.897 0.877 0.818
0.00092 0.897 0.877 0.818
0.00105 0.898 0.881 0.806
0.00131 0.906 0.882 0.816
0.00145 0.91 0.844 0.848
0.00158 0.911 0.847 0.846
0.0021 0.912 0.811 0.862
0.00224 0.912 0.811 0.862
0.00237 0.912 0.811 0.862
0.00272 0.912 0.811 0.862
0.00276 0.912 0.811 0.862
0.0028 0.912 0.8 0.865
0.00289 0.912 0.8 0.865
0.00394 0.883 0.886 0.811
0.00421 0.875 0.858 0.81
0.0046 0.875 0.858 0.81
0.00526 0.874 0.858 0.81
0.00736 0.884 0.837 0.813
0.0113 0.884 0.837 0.813
0.021 0.871 0.947 0.727
0.0227 0.871 0.947 0.727
0.0465 0.85 0.944 0.735
0.0715 0.852 0.944 0.738
0.387 0.815 0.991 0.638
ROC was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.00289.
>
> library(partykit)
Loading required package: grid
> plot(as.party(rpartFit$finalModel))
>
> rpart2008 <- merge(rpartFit$pred, rpartFit$bestTune)
> rpartCM <- confusionMatrix(rpartFit, norm = "none")
> rpartCM
Repeated Train/Test Splits Estimated (1 reps, 0.75%) Confusion Matrix
(entries are un-normalized counts)
Loading required package: class
Confusion Matrix and Statistics
Reference
Prediction successful unsuccessful
successful 456 133
unsuccessful 114 854
Accuracy : 0.8414
95% CI : (0.8223, 0.8592)
No Information Rate : 0.6339
P-Value [Acc > NIR] : <2e-16
Kappa : 0.6606
Mcnemar's Test P-Value : 0.2521
Sensitivity : 0.8000
Specificity : 0.8652
Pos Pred Value : 0.7742
Neg Pred Value : 0.8822
Prevalence : 0.3661
Detection Rate : 0.2929
Detection Prevalence : 0.3783
Balanced Accuracy : 0.8326
'Positive' Class : successful
> rpartRoc <- roc(response = rpartFit$pred$obs,
+ predictor = rpartFit$pred$successful,
+ levels = rev(levels(rpartFit$pred$obs)))
>
> set.seed(476)
> rpartFactorFit <- train(x = training[,factorPredictors],
+ y = training$Class,
+ method = "rpart",
+ tuneLength = 30,
+ metric = "ROC",
+ trControl = ctrl)
> rpartFactorFit
CART
8190 samples
1488 predictors
2 classes: 'successful', 'unsuccessful'
No pre-processing
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%)
Summary of sample sizes: 6633
Resampling results across tuning parameters:
cp ROC Sens Spec
0.000175 0.901 0.735 0.87
0.00021 0.901 0.735 0.87
0.000263 0.901 0.735 0.87
0.000368 0.891 0.761 0.864
0.000376 0.891 0.761 0.864
0.000394 0.891 0.761 0.864
0.000526 0.891 0.775 0.865
0.000657 0.895 0.795 0.866
0.000789 0.899 0.821 0.864
0.000877 0.899 0.821 0.864
0.00092 0.899 0.821 0.864
0.00105 0.897 0.825 0.856
0.00118 0.898 0.825 0.853
0.00131 0.894 0.837 0.847
0.00145 0.894 0.837 0.847
0.00184 0.902 0.825 0.855
0.00237 0.902 0.825 0.858
0.0025 0.903 0.821 0.866
0.00263 0.903 0.821 0.866
0.00289 0.91 0.812 0.872
0.00394 0.892 0.847 0.831
0.00539 0.892 0.847 0.831
0.0071 0.892 0.847 0.831
0.00763 0.901 0.847 0.831
0.0116 0.899 0.828 0.834
0.0146 0.899 0.828 0.834
0.0318 0.9 0.823 0.841
0.0652 0.867 0.865 0.779
0.153 0.817 0.988 0.645
0.393 0.817 0.988 0.645
ROC was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.00289.
> plot(as.party(rpartFactorFit$finalModel))
>
> rpartFactor2008 <- merge(rpartFactorFit$pred, rpartFactorFit$bestTune)
> rpartFactorCM <- confusionMatrix(rpartFactorFit, norm = "none")
> rpartFactorCM
Repeated Train/Test Splits Estimated (1 reps, 0.75%) Confusion Matrix
(entries are un-normalized counts)
Confusion Matrix and Statistics
Reference
Prediction successful unsuccessful
successful 463 126
unsuccessful 107 861
Accuracy : 0.8504
95% CI : (0.8317, 0.8677)
No Information Rate : 0.6339
P-Value [Acc > NIR] : <2e-16
Kappa : 0.6798
Mcnemar's Test P-Value : 0.2383
Sensitivity : 0.8123
Specificity : 0.8723
Pos Pred Value : 0.7861
Neg Pred Value : 0.8895
Prevalence : 0.3661
Detection Rate : 0.2974
Detection Prevalence : 0.3783
Balanced Accuracy : 0.8423
'Positive' Class : successful
>
> rpartFactorRoc <- roc(response = rpartFactorFit$pred$obs,
+ predictor = rpartFactorFit$pred$successful,
+ levels = rev(levels(rpartFactorFit$pred$obs)))
>
> plot(rpartRoc, type = "s", print.thres = c(.5),
+ print.thres.pch = 3,
+ print.thres.pattern = "",
+ print.thres.cex = 1.2,
+ col = "red", legacy.axes = TRUE,
+ print.thres.col = "red")
Call:
roc.default(response = rpartFit$pred$obs, predictor = rpartFit$pred$successful, levels = rev(levels(rpartFit$pred$obs)))
Data: rpartFit$pred$successful in 29610 controls (rpartFit$pred$obs unsuccessful) < 17100 cases (rpartFit$pred$obs successful).
Area under the curve: 0.8915
> plot(rpartFactorRoc,
+ type = "s",
+ add = TRUE,
+ print.thres = c(.5),
+ print.thres.pch = 16, legacy.axes = TRUE,
+ print.thres.pattern = "",
+ print.thres.cex = 1.2)
Call:
roc.default(response = rpartFactorFit$pred$obs, predictor = rpartFactorFit$pred$successful, levels = rev(levels(rpartFactorFit$pred$obs)))
Data: rpartFactorFit$pred$successful in 29610 controls (rpartFactorFit$pred$obs unsuccessful) < 17100 cases (rpartFactorFit$pred$obs successful).
Area under the curve: 0.8856
> legend(.75, .2,
+ c("Grouped Categories", "Independent Categories"),
+ lwd = c(1, 1),
+ col = c("black", "red"),
+ pch = c(16, 3))
>
> set.seed(476)
> j48FactorFit <- train(x = training[,factorPredictors],
+ y = training$Class,
+ method = "J48",
+ metric = "ROC",
+ trControl = ctrl)
Loading required package: RWeka
> j48FactorFit
C4.5-like Trees
8190 samples
1488 predictors
2 classes: 'successful', 'unsuccessful'
No pre-processing
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%)
Summary of sample sizes: 6633
Resampling results
ROC Sens Spec
0.835 0.839 0.817
Tuning parameter 'C' was held constant at a value of 0.25
>
> j48Factor2008 <- merge(j48FactorFit$pred, j48FactorFit$bestTune)
> j48FactorCM <- confusionMatrix(j48FactorFit, norm = "none")
> j48FactorCM
Repeated Train/Test Splits Estimated (1 reps, 0.75%) Confusion Matrix
(entries are un-normalized counts)
Confusion Matrix and Statistics
Reference
Prediction successful unsuccessful
successful 478 181
unsuccessful 92 806
Accuracy : 0.8247
95% CI : (0.8048, 0.8432)
No Information Rate : 0.6339
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.6343
Mcnemar's Test P-Value : 1.004e-07
Sensitivity : 0.8386
Specificity : 0.8166
Pos Pred Value : 0.7253
Neg Pred Value : 0.8976
Prevalence : 0.3661
Detection Rate : 0.3070
Detection Prevalence : 0.4232
Balanced Accuracy : 0.8276
'Positive' Class : successful
>
> j48FactorRoc <- roc(response = j48FactorFit$pred$obs,
+ predictor = j48FactorFit$pred$successful,
+ levels = rev(levels(j48FactorFit$pred$obs)))
>
> set.seed(476)
> j48Fit <- train(x = training[,fullSet],
+ y = training$Class,
+ method = "J48",
+ metric = "ROC",
+ trControl = ctrl)
>
> j482008 <- merge(j48Fit$pred, j48Fit$bestTune)
> j48CM <- confusionMatrix(j48Fit, norm = "none")
> j48CM
Repeated Train/Test Splits Estimated (1 reps, 0.75%) Confusion Matrix
(entries are un-normalized counts)
Confusion Matrix and Statistics
Reference
Prediction successful unsuccessful
successful 438 160
unsuccessful 132 827
Accuracy : 0.8125
95% CI : (0.7922, 0.8316)
No Information Rate : 0.6339
P-Value [Acc > NIR] : <2e-16
Kappa : 0.6001
Mcnemar's Test P-Value : 0.1141
Sensitivity : 0.7684
Specificity : 0.8379
Pos Pred Value : 0.7324
Neg Pred Value : 0.8624
Prevalence : 0.3661
Detection Rate : 0.2813
Detection Prevalence : 0.3841
Balanced Accuracy : 0.8032
'Positive' Class : successful
>
> j48Roc <- roc(response = j48Fit$pred$obs,
+ predictor = j48Fit$pred$successful,
+ levels = rev(levels(j48Fit$pred$obs)))
>
>
> plot(j48FactorRoc, type = "s", print.thres = c(.5),
+ print.thres.pch = 16, print.thres.pattern = "",
+ print.thres.cex = 1.2, legacy.axes = TRUE)
Call:
roc.default(response = j48FactorFit$pred$obs, predictor = j48FactorFit$pred$successful, levels = rev(levels(j48FactorFit$pred$obs)))
Data: j48FactorFit$pred$successful in 987 controls (j48FactorFit$pred$obs unsuccessful) < 570 cases (j48FactorFit$pred$obs successful).
Area under the curve: 0.8353
> plot(j48Roc, type = "s", print.thres = c(.5),
+ print.thres.pch = 3, print.thres.pattern = "",
+ print.thres.cex = 1.2, legacy.axes = TRUE,
+ add = TRUE, col = "red", print.thres.col = "red")
Call:
roc.default(response = j48Fit$pred$obs, predictor = j48Fit$pred$successful, levels = rev(levels(j48Fit$pred$obs)))
Data: j48Fit$pred$successful in 987 controls (j48Fit$pred$obs unsuccessful) < 570 cases (j48Fit$pred$obs successful).
Area under the curve: 0.842
> legend(.75, .2,
+ c("Grouped Categories", "Independent Categories"),
+ lwd = c(1, 1),
+ col = c("black", "red"),
+ pch = c(16, 3))
>
> plot(rpartFactorRoc, type = "s", add = TRUE,
+ col = rgb(.2, .2, .2, .2), legacy.axes = TRUE)
Call:
roc.default(response = rpartFactorFit$pred$obs, predictor = rpartFactorFit$pred$successful, levels = rev(levels(rpartFactorFit$pred$obs)))
Data: rpartFactorFit$pred$successful in 29610 controls (rpartFactorFit$pred$obs unsuccessful) < 17100 cases (rpartFactorFit$pred$obs successful).
Area under the curve: 0.8856
>
> ################################################################################
> ### Section 14.2 Rule-Based Models
>
> set.seed(476)
> partFit <- train(x = training[,fullSet],
+ y = training$Class,
+ method = "PART",
+ metric = "ROC",
+ trControl = ctrl)
> partFit
Rule-Based Classifier
8190 samples
1070 predictors
2 classes: 'successful', 'unsuccessful'
No pre-processing
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%)
Summary of sample sizes: 6633
Resampling results
ROC Sens Spec
0.809 0.779 0.802
Tuning parameter 'threshold' was held constant at a value of 0.25
Tuning parameter 'pruned' was held constant at a value of yes
>
> part2008 <- merge(partFit$pred, partFit$bestTune)
> partCM <- confusionMatrix(partFit, norm = "none")
> partCM
Repeated Train/Test Splits Estimated (1 reps, 0.75%) Confusion Matrix
(entries are un-normalized counts)
Confusion Matrix and Statistics
Reference
Prediction successful unsuccessful
successful 444 195
unsuccessful 126 792
Accuracy : 0.7938
95% CI : (0.7729, 0.8137)
No Information Rate : 0.6339
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.5669
Mcnemar's Test P-Value : 0.0001474
Sensitivity : 0.7789
Specificity : 0.8024
Pos Pred Value : 0.6948
Neg Pred Value : 0.8627
Prevalence : 0.3661
Detection Rate : 0.2852
Detection Prevalence : 0.4104
Balanced Accuracy : 0.7907
'Positive' Class : successful
>
> partRoc <- roc(response = partFit$pred$obs,
+ predictor = partFit$pred$successful,
+ levels = rev(levels(partFit$pred$obs)))
> partRoc
Call:
roc.default(response = partFit$pred$obs, predictor = partFit$pred$successful, levels = rev(levels(partFit$pred$obs)))
Data: partFit$pred$successful in 987 controls (partFit$pred$obs unsuccessful) < 570 cases (partFit$pred$obs successful).
Area under the curve: 0.809
>
> set.seed(476)
> partFactorFit <- train(training[,factorPredictors], training$Class,
+ method = "PART",
+ metric = "ROC",
+ trControl = ctrl)
> partFactorFit
Rule-Based Classifier
8190 samples
1488 predictors
2 classes: 'successful', 'unsuccessful'
No pre-processing
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%)
Summary of sample sizes: 6633
Resampling results
ROC Sens Spec
0.835 0.807 0.766
Tuning parameter 'threshold' was held constant at a value of 0.25
Tuning parameter 'pruned' was held constant at a value of yes
>
> partFactor2008 <- merge(partFactorFit$pred, partFactorFit$bestTune)
> partFactorCM <- confusionMatrix(partFactorFit, norm = "none")
> partFactorCM
Repeated Train/Test Splits Estimated (1 reps, 0.75%) Confusion Matrix
(entries are un-normalized counts)
Confusion Matrix and Statistics
Reference
Prediction successful unsuccessful
successful 460 231
unsuccessful 110 756
Accuracy : 0.781
95% CI : (0.7596, 0.8013)
No Information Rate : 0.6339
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.5484
Mcnemar's Test P-Value : 8.12e-11
Sensitivity : 0.8070
Specificity : 0.7660
Pos Pred Value : 0.6657
Neg Pred Value : 0.8730
Prevalence : 0.3661
Detection Rate : 0.2954
Detection Prevalence : 0.4438
Balanced Accuracy : 0.7865
'Positive' Class : successful
>
> partFactorRoc <- roc(response = partFactorFit$pred$obs,
+ predictor = partFactorFit$pred$successful,
+ levels = rev(levels(partFactorFit$pred$obs)))
> partFactorRoc
Call:
roc.default(response = partFactorFit$pred$obs, predictor = partFactorFit$pred$successful, levels = rev(levels(partFactorFit$pred$obs)))
Data: partFactorFit$pred$successful in 987 controls (partFactorFit$pred$obs unsuccessful) < 570 cases (partFactorFit$pred$obs successful).
Area under the curve: 0.8347
>
> ################################################################################
> ### Section 14.3 Bagged Trees
>
> set.seed(476)
> treebagFit <- train(x = training[,fullSet],
+ y = training$Class,
+ method = "treebag",
+ nbagg = 50,
+ metric = "ROC",
+ trControl = ctrl)
Loading required package: ipred
Loading required package: MASS
Loading required package: survival
Loading required package: splines
Attaching package: 'survival'
The following object is masked from 'package:caret':
cluster
Loading required package: nnet
Loading required package: prodlim
KernSmooth 2.23 loaded
Copyright M. P. Wand 1997-2009
> treebagFit
Bagged CART
8190 samples
1070 predictors
2 classes: 'successful', 'unsuccessful'
No pre-processing
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%)
Summary of sample sizes: 6633
Resampling results
ROC Sens Spec
0.921 0.83 0.857
>
> treebag2008 <- merge(treebagFit$pred, treebagFit$bestTune)
> treebagCM <- confusionMatrix(treebagFit, norm = "none")
> treebagCM
Repeated Train/Test Splits Estimated (1 reps, 0.75%) Confusion Matrix
(entries are un-normalized counts)
Confusion Matrix and Statistics
Reference
Prediction successful unsuccessful
successful 473 141
unsuccessful 97 846
Accuracy : 0.8471
95% CI : (0.8283, 0.8647)
No Information Rate : 0.6339
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.6759
Mcnemar's Test P-Value : 0.005315
Sensitivity : 0.8298
Specificity : 0.8571
Pos Pred Value : 0.7704
Neg Pred Value : 0.8971
Prevalence : 0.3661
Detection Rate : 0.3038
Detection Prevalence : 0.3943
Balanced Accuracy : 0.8435
'Positive' Class : successful
>
> treebagRoc <- roc(response = treebagFit$pred$obs,
+ predictor = treebagFit$pred$successful,
+ levels = rev(levels(treebagFit$pred$obs)))
> set.seed(476)
> treebagFactorFit <- train(x = training[,factorPredictors],
+ y = training$Class,
+ method = "treebag",
+ nbagg = 50,
+ metric = "ROC",
+ trControl = ctrl)
> treebagFactorFit
Bagged CART
8190 samples
1488 predictors
2 classes: 'successful', 'unsuccessful'
No pre-processing
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%)
Summary of sample sizes: 6633
Resampling results
ROC Sens Spec
0.917 0.835 0.861
>
> treebagFactor2008 <- merge(treebagFactorFit$pred, treebagFactorFit$bestTune)
> treebagFactorCM <- confusionMatrix(treebagFactorFit, norm = "none")
> treebagFactorCM
Repeated Train/Test Splits Estimated (1 reps, 0.75%) Confusion Matrix
(entries are un-normalized counts)
Confusion Matrix and Statistics
Reference
Prediction successful unsuccessful
successful 476 137
unsuccessful 94 850
Accuracy : 0.8516
95% CI : (0.833, 0.8689)
No Information Rate : 0.6339
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.6854
Mcnemar's Test P-Value : 0.00572
Sensitivity : 0.8351
Specificity : 0.8612
Pos Pred Value : 0.7765
Neg Pred Value : 0.9004
Prevalence : 0.3661
Detection Rate : 0.3057
Detection Prevalence : 0.3937
Balanced Accuracy : 0.8481
'Positive' Class : successful
> treebagFactorRoc <- roc(response = treebagFactorFit$pred$obs,
+ predictor = treebagFactorFit$pred$successful,
+ levels = rev(levels(treebagFactorFit$pred$obs)))
>
>
> plot(rpartRoc, type = "s", col = rgb(.2, .2, .2, .2), legacy.axes = TRUE)
Call:
roc.default(response = rpartFit$pred$obs, predictor = rpartFit$pred$successful, levels = rev(levels(rpartFit$pred$obs)))
Data: rpartFit$pred$successful in 29610 controls (rpartFit$pred$obs unsuccessful) < 17100 cases (rpartFit$pred$obs successful).
Area under the curve: 0.8915
> plot(j48FactorRoc, type = "s", add = TRUE, col = rgb(.2, .2, .2, .2),
+ legacy.axes = TRUE)
Call:
roc.default(response = j48FactorFit$pred$obs, predictor = j48FactorFit$pred$successful, levels = rev(levels(j48FactorFit$pred$obs)))
Data: j48FactorFit$pred$successful in 987 controls (j48FactorFit$pred$obs unsuccessful) < 570 cases (j48FactorFit$pred$obs successful).
Area under the curve: 0.8353
> plot(treebagRoc, type = "s", add = TRUE, print.thres = c(.5),
+ print.thres.pch = 3, legacy.axes = TRUE, print.thres.pattern = "",
+ print.thres.cex = 1.2,
+ col = "red", print.thres.col = "red")
Call:
roc.default(response = treebagFit$pred$obs, predictor = treebagFit$pred$successful, levels = rev(levels(treebagFit$pred$obs)))
Data: treebagFit$pred$successful in 987 controls (treebagFit$pred$obs unsuccessful) < 570 cases (treebagFit$pred$obs successful).
Area under the curve: 0.9205
> plot(treebagFactorRoc, type = "s", add = TRUE, print.thres = c(.5),
+ print.thres.pch = 16, print.thres.pattern = "", legacy.axes = TRUE,
+ print.thres.cex = 1.2)
Call:
roc.default(response = treebagFactorFit$pred$obs, predictor = treebagFactorFit$pred$successful, levels = rev(levels(treebagFactorFit$pred$obs)))
Data: treebagFactorFit$pred$successful in 987 controls (treebagFactorFit$pred$obs unsuccessful) < 570 cases (treebagFactorFit$pred$obs successful).
Area under the curve: 0.9173
> legend(.75, .2,
+ c("Grouped Categories", "Independent Categories"),
+ lwd = c(1, 1),
+ col = c("black", "red"),
+ pch = c(16, 3))
>
> ################################################################################
> ### Section 14.4 Random Forests
>
> ### For the book, this model was run with only 500 trees (by
> ### accident). More than 1000 trees usually required to get consistent
> ### results.
>
> mtryValues <- c(5, 10, 20, 32, 50, 100, 250, 500, 1000)
> set.seed(476)
> rfFit <- train(x = training[,fullSet],
+ y = training$Class,
+ method = "rf",
+ ntree = 500,
+ tuneGrid = data.frame(mtry = mtryValues),
+ importance = TRUE,
+ metric = "ROC",
+ trControl = ctrl)
Loading required package: randomForest
randomForest 4.6-7
Type rfNews() to see new features/changes/bug fixes.
> rfFit
Random Forest
8190 samples
1070 predictors
2 classes: 'successful', 'unsuccessful'
No pre-processing
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%)
Summary of sample sizes: 6633
Resampling results across tuning parameters:
mtry ROC Sens Spec
5 0.876 0.805 0.769
10 0.901 0.828 0.812
20 0.924 0.861 0.827
32 0.931 0.879 0.835
50 0.936 0.877 0.835
100 0.939 0.867 0.846
250 0.937 0.856 0.858
500 0.93 0.844 0.862
1000 0.923 0.837 0.853
ROC was used to select the optimal model using the largest value.
The final value used for the model was mtry = 100.
>
> rf2008 <- merge(rfFit$pred, rfFit$bestTune)
> rfCM <- confusionMatrix(rfFit, norm = "none")
> rfCM
Repeated Train/Test Splits Estimated (1 reps, 0.75%) Confusion Matrix
(entries are un-normalized counts)
Confusion Matrix and Statistics
Reference
Prediction successful unsuccessful
successful 494 152
unsuccessful 76 835
Accuracy : 0.8536
95% CI : (0.835, 0.8708)
No Information Rate : 0.6339
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.6931
Mcnemar's Test P-Value : 6.8e-07
Sensitivity : 0.8667
Specificity : 0.8460
Pos Pred Value : 0.7647
Neg Pred Value : 0.9166
Prevalence : 0.3661
Detection Rate : 0.3173
Detection Prevalence : 0.4149
Balanced Accuracy : 0.8563
'Positive' Class : successful
>
> rfRoc <- roc(response = rfFit$pred$obs,
+ predictor = rfFit$pred$successful,
+ levels = rev(levels(rfFit$pred$obs)))
>
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 8050579 430.0 13156139 702.7 13156139 702.7
Vcells 4127289672 31488.8 6062765953 46255.3 5498501682 41950.3
>
> ## The randomForest package cannot handle factors with more than 32
> ## levels, so we make a new set of predictors where the sponsor code
> ## factor is entered as dummy variables instead of a single factor.
>
> sponsorVars <- grep("Sponsor", names(training), value = TRUE)
> sponsorVars <- sponsorVars[sponsorVars != "SponsorCode"]
>
> rfPredictors <- factorPredictors
> rfPredictors <- rfPredictors[rfPredictors != "SponsorCode"]
> rfPredictors <- c(rfPredictors, sponsorVars)
>
> set.seed(476)
> rfFactorFit <- train(x = training[,rfPredictors],
+ y = training$Class,
+ method = "rf",
+ ntree = 1500,
+ tuneGrid = data.frame(mtry = mtryValues),
+ importance = TRUE,
+ metric = "ROC",
+ trControl = ctrl)
> rfFactorFit
Random Forest
8190 samples
1733 predictors
2 classes: 'successful', 'unsuccessful'
No pre-processing
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%)
Summary of sample sizes: 6633
Resampling results across tuning parameters:
mtry ROC Sens Spec
5 0.808 0.619 0.817
10 0.855 0.726 0.815
20 0.891 0.754 0.84
32 0.911 0.774 0.855
50 0.921 0.802 0.865
100 0.93 0.823 0.87
250 0.937 0.842 0.871
500 0.936 0.847 0.876
1000 0.931 0.837 0.872
ROC was used to select the optimal model using the largest value.
The final value used for the model was mtry = 250.
>
> rfFactor2008 <- merge(rfFactorFit$pred, rfFactorFit$bestTune)
> rfFactorCM <- confusionMatrix(rfFactorFit, norm = "none")
> rfFactorCM
Repeated Train/Test Splits Estimated (1 reps, 0.75%) Confusion Matrix
(entries are un-normalized counts)
Confusion Matrix and Statistics
Reference
Prediction successful unsuccessful
successful 480 127
unsuccessful 90 860
Accuracy : 0.8606
95% CI : (0.8424, 0.8775)
No Information Rate : 0.6339
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.7038
Mcnemar's Test P-Value : 0.01453
Sensitivity : 0.8421
Specificity : 0.8713
Pos Pred Value : 0.7908
Neg Pred Value : 0.9053
Prevalence : 0.3661
Detection Rate : 0.3083
Detection Prevalence : 0.3899
Balanced Accuracy : 0.8567
'Positive' Class : successful
>
> rfFactorRoc <- roc(response = rfFactorFit$pred$obs,
+ predictor = rfFactorFit$pred$successful,
+ levels = rev(levels(rfFactorFit$pred$obs)))
>
> plot(treebagRoc, type = "s", col = rgb(.2, .2, .2, .2), legacy.axes = TRUE)
Call:
roc.default(response = treebagFit$pred$obs, predictor = treebagFit$pred$successful, levels = rev(levels(treebagFit$pred$obs)))
Data: treebagFit$pred$successful in 987 controls (treebagFit$pred$obs unsuccessful) < 570 cases (treebagFit$pred$obs successful).
Area under the curve: 0.9205
> plot(rpartRoc, type = "s", add = TRUE, col = rgb(.2, .2, .2, .2), legacy.axes = TRUE)
Call:
roc.default(response = rpartFit$pred$obs, predictor = rpartFit$pred$successful, levels = rev(levels(rpartFit$pred$obs)))
Data: rpartFit$pred$successful in 29610 controls (rpartFit$pred$obs unsuccessful) < 17100 cases (rpartFit$pred$obs successful).
Area under the curve: 0.8915
> plot(j48FactorRoc, type = "s", add = TRUE, col = rgb(.2, .2, .2, .2),
+ legacy.axes = TRUE)
Call:
roc.default(response = j48FactorFit$pred$obs, predictor = j48FactorFit$pred$successful, levels = rev(levels(j48FactorFit$pred$obs)))
Data: j48FactorFit$pred$successful in 987 controls (j48FactorFit$pred$obs unsuccessful) < 570 cases (j48FactorFit$pred$obs successful).
Area under the curve: 0.8353
> plot(rfRoc, type = "s", add = TRUE, print.thres = c(.5),
+ print.thres.pch = 3, legacy.axes = TRUE, print.thres.pattern = "",
+ print.thres.cex = 1.2,
+ col = "red", print.thres.col = "red")
Call:
roc.default(response = rfFit$pred$obs, predictor = rfFit$pred$successful, levels = rev(levels(rfFit$pred$obs)))
Data: rfFit$pred$successful in 8883 controls (rfFit$pred$obs unsuccessful) < 5130 cases (rfFit$pred$obs successful).
Area under the curve: 0.9179
> plot(rfFactorRoc, type = "s", add = TRUE, print.thres = c(.5),
+ print.thres.pch = 16, print.thres.pattern = "", legacy.axes = TRUE,
+ print.thres.cex = 1.2)
Call:
roc.default(response = rfFactorFit$pred$obs, predictor = rfFactorFit$pred$successful, levels = rev(levels(rfFactorFit$pred$obs)))
Data: rfFactorFit$pred$successful in 8883 controls (rfFactorFit$pred$obs unsuccessful) < 5130 cases (rfFactorFit$pred$obs successful).
Area under the curve: 0.9049
> legend(.75, .2,
+ c("Grouped Categories", "Independent Categories"),
+ lwd = c(1, 1),
+ col = c("black", "red"),
+ pch = c(16, 3))
>
>
> ################################################################################
> ### Section 14.5 Boosting
>
> gbmGrid <- expand.grid(interaction.depth = c(1, 3, 5, 7, 9),
+ n.trees = (1:20)*100,
+ shrinkage = c(.01, .1))
>
> set.seed(476)
> gbmFit <- train(x = training[,fullSet],
+ y = training$Class,
+ method = "gbm",
+ tuneGrid = gbmGrid,
+ metric = "ROC",
+ verbose = FALSE,
+ trControl = ctrl)
Loading required package: gbm
Loading required package: parallel
Loaded gbm 2.1
> gbmFit
Stochastic Gradient Boosting
8190 samples
1070 predictors
2 classes: 'successful', 'unsuccessful'
No pre-processing
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%)
Summary of sample sizes: 6633
Resampling results across tuning parameters:
shrinkage interaction.depth n.trees ROC Sens Spec
0.01 1 100 0.879 0.947 0.73
0.01 1 200 0.887 0.947 0.73
0.01 1 300 0.888 0.951 0.73
0.01 1 400 0.911 0.951 0.73
0.01 1 500 0.906 0.951 0.73
0.01 1 600 0.908 0.904 0.8
0.01 1 700 0.907 0.905 0.799
0.01 1 800 0.91 0.905 0.798
0.01 1 900 0.911 0.904 0.799
0.01 1 1000 0.914 0.904 0.799
0.01 1 1100 0.914 0.9 0.807
0.01 1 1200 0.915 0.898 0.814
0.01 1 1300 0.916 0.893 0.816
0.01 1 1400 0.917 0.889 0.821
0.01 1 1500 0.918 0.875 0.822
0.01 1 1600 0.919 0.877 0.826
0.01 1 1700 0.919 0.865 0.831
0.01 1 1800 0.919 0.839 0.842
0.01 1 1900 0.92 0.858 0.841
0.01 1 2000 0.92 0.835 0.847
0.01 3 100 0.913 0.947 0.729
0.01 3 200 0.918 0.889 0.81
0.01 3 300 0.919 0.889 0.809
0.01 3 400 0.922 0.889 0.809
0.01 3 500 0.924 0.889 0.818
0.01 3 600 0.926 0.882 0.829
0.01 3 700 0.926 0.868 0.84
0.01 3 800 0.928 0.875 0.845
0.01 3 900 0.928 0.868 0.847
0.01 3 1000 0.929 0.867 0.851
0.01 3 1100 0.93 0.865 0.854
0.01 3 1200 0.931 0.863 0.86
0.01 3 1300 0.931 0.86 0.863
0.01 3 1400 0.932 0.863 0.863
0.01 3 1500 0.932 0.86 0.865
0.01 3 1600 0.932 0.856 0.869
0.01 3 1700 0.932 0.853 0.865
0.01 3 1800 0.933 0.853 0.864
0.01 3 1900 0.933 0.851 0.865
0.01 3 2000 0.933 0.851 0.867
0.01 5 100 0.914 0.947 0.726
0.01 5 200 0.917 0.895 0.802
0.01 5 300 0.925 0.904 0.804
0.01 5 400 0.928 0.895 0.83
0.01 5 500 0.93 0.872 0.839
0.01 5 600 0.932 0.87 0.845
0.01 5 700 0.932 0.87 0.851
0.01 5 800 0.934 0.867 0.855
0.01 5 900 0.934 0.865 0.854
0.01 5 1000 0.935 0.86 0.861
0.01 5 1100 0.935 0.861 0.86
0.01 5 1200 0.935 0.861 0.861
0.01 5 1300 0.935 0.86 0.865
0.01 5 1400 0.935 0.854 0.865
0.01 5 1500 0.935 0.856 0.868
0.01 5 1600 0.935 0.854 0.868
0.01 5 1700 0.935 0.849 0.872
0.01 5 1800 0.935 0.844 0.873
0.01 5 1900 0.934 0.846 0.873
0.01 5 2000 0.935 0.837 0.875
0.01 7 100 0.913 0.893 0.798
0.01 7 200 0.92 0.911 0.802
0.01 7 300 0.926 0.898 0.828
0.01 7 400 0.931 0.87 0.842
0.01 7 500 0.932 0.867 0.849
0.01 7 600 0.933 0.865 0.854
0.01 7 700 0.934 0.863 0.858
0.01 7 800 0.934 0.858 0.861
0.01 7 900 0.935 0.853 0.863
0.01 7 1000 0.935 0.849 0.865
0.01 7 1100 0.935 0.847 0.864
0.01 7 1200 0.935 0.84 0.867
0.01 7 1300 0.935 0.839 0.872
0.01 7 1400 0.935 0.837 0.875
0.01 7 1500 0.935 0.83 0.874
0.01 7 1600 0.935 0.83 0.875
0.01 7 1700 0.935 0.832 0.878
0.01 7 1800 0.935 0.826 0.878
0.01 7 1900 0.935 0.819 0.876
0.01 7 2000 0.935 0.825 0.876
0.01 9 100 0.919 0.895 0.796
0.01 9 200 0.927 0.902 0.818
0.01 9 300 0.93 0.872 0.844
0.01 9 400 0.933 0.863 0.854
0.01 9 500 0.935 0.86 0.859
0.01 9 600 0.935 0.863 0.861
0.01 9 700 0.936 0.858 0.865
0.01 9 800 0.936 0.851 0.866
0.01 9 900 0.936 0.846 0.87
0.01 9 1000 0.936 0.849 0.869
0.01 9 1100 0.936 0.846 0.87
0.01 9 1200 0.936 0.846 0.873
0.01 9 1300 0.936 0.842 0.875
0.01 9 1400 0.936 0.842 0.876
0.01 9 1500 0.936 0.837 0.878
0.01 9 1600 0.935 0.84 0.879
0.01 9 1700 0.935 0.835 0.877
0.01 9 1800 0.935 0.837 0.879
0.01 9 1900 0.935 0.832 0.878
0.01 9 2000 0.935 0.823 0.877
0.1 1 100 0.914 0.889 0.813
0.1 1 200 0.92 0.805 0.864
0.1 1 300 0.921 0.828 0.859
0.1 1 400 0.923 0.821 0.86
0.1 1 500 0.922 0.816 0.865
0.1 1 600 0.923 0.809 0.869
0.1 1 700 0.922 0.819 0.87
0.1 1 800 0.922 0.818 0.869
0.1 1 900 0.922 0.819 0.871
0.1 1 1000 0.921 0.823 0.869
0.1 1 1100 0.92 0.816 0.868
0.1 1 1200 0.918 0.814 0.869
0.1 1 1300 0.917 0.816 0.867
0.1 1 1400 0.918 0.811 0.866
0.1 1 1500 0.916 0.807 0.868
0.1 1 1600 0.915 0.807 0.867
0.1 1 1700 0.916 0.804 0.871
0.1 1 1800 0.914 0.807 0.869
0.1 1 1900 0.913 0.802 0.866
0.1 1 2000 0.913 0.802 0.865
0.1 3 100 0.925 0.856 0.847
0.1 3 200 0.932 0.839 0.871
0.1 3 300 0.933 0.835 0.874
0.1 3 400 0.932 0.83 0.877
0.1 3 500 0.93 0.821 0.88
0.1 3 600 0.928 0.826 0.868
0.1 3 700 0.927 0.809 0.875
0.1 3 800 0.925 0.814 0.877
0.1 3 900 0.924 0.802 0.879
0.1 3 1000 0.923 0.804 0.878
0.1 3 1100 0.923 0.804 0.876
0.1 3 1200 0.923 0.8 0.873
0.1 3 1300 0.921 0.796 0.876
0.1 3 1400 0.922 0.793 0.877
0.1 3 1500 0.921 0.793 0.878
0.1 3 1600 0.921 0.791 0.877
0.1 3 1700 0.922 0.784 0.878
0.1 3 1800 0.92 0.775 0.883
0.1 3 1900 0.921 0.784 0.881
0.1 3 2000 0.918 0.786 0.881
0.1 5 100 0.934 0.86 0.868
0.1 5 200 0.935 0.846 0.87
0.1 5 300 0.933 0.833 0.872
0.1 5 400 0.932 0.828 0.875
0.1 5 500 0.931 0.816 0.875
0.1 5 600 0.93 0.832 0.877
0.1 5 700 0.929 0.818 0.879
0.1 5 800 0.926 0.8 0.882
0.1 5 900 0.927 0.802 0.883
0.1 5 1000 0.926 0.796 0.878
0.1 5 1100 0.926 0.807 0.881
0.1 5 1200 0.925 0.807 0.875
0.1 5 1300 0.925 0.805 0.877
0.1 5 1400 0.924 0.796 0.875
0.1 5 1500 0.924 0.809 0.877
0.1 5 1600 0.924 0.807 0.878
0.1 5 1700 0.923 0.811 0.878
0.1 5 1800 0.923 0.811 0.878
0.1 5 1900 0.921 0.809 0.876
0.1 5 2000 0.922 0.809 0.871
0.1 7 100 0.934 0.84 0.875
0.1 7 200 0.931 0.809 0.875
0.1 7 300 0.93 0.796 0.879
0.1 7 400 0.928 0.793 0.877
0.1 7 500 0.926 0.804 0.873
0.1 7 600 0.924 0.784 0.872
0.1 7 700 0.922 0.782 0.877
0.1 7 800 0.923 0.789 0.873
0.1 7 900 0.924 0.796 0.873
0.1 7 1000 0.924 0.793 0.875
0.1 7 1100 0.924 0.793 0.872
0.1 7 1200 0.923 0.791 0.876
0.1 7 1300 0.925 0.782 0.877
0.1 7 1400 0.923 0.775 0.878
0.1 7 1500 0.923 0.767 0.877
0.1 7 1600 0.923 0.767 0.877
0.1 7 1700 0.922 0.772 0.878
0.1 7 1800 0.922 0.779 0.879
0.1 7 1900 0.922 0.768 0.878
0.1 7 2000 0.921 0.77 0.878
0.1 9 100 0.933 0.828 0.871
0.1 9 200 0.931 0.814 0.889
0.1 9 300 0.929 0.796 0.887
0.1 9 400 0.928 0.793 0.881
0.1 9 500 0.926 0.789 0.884
0.1 9 600 0.927 0.779 0.883
0.1 9 700 0.928 0.791 0.883
0.1 9 800 0.928 0.791 0.884
0.1 9 900 0.926 0.777 0.881
0.1 9 1000 0.925 0.772 0.886
0.1 9 1100 0.925 0.777 0.887
0.1 9 1200 0.925 0.772 0.887
0.1 9 1300 0.925 0.763 0.883
0.1 9 1400 0.924 0.772 0.883
0.1 9 1500 0.922 0.763 0.88
0.1 9 1600 0.922 0.761 0.884
0.1 9 1700 0.922 0.76 0.883
0.1 9 1800 0.922 0.758 0.882
0.1 9 1900 0.923 0.76 0.884
0.1 9 2000 0.923 0.765 0.886
ROC was used to select the optimal model using the largest value.
The final values used for the model were n.trees = 1300, interaction.depth =
9 and shrinkage = 0.01.
>
> gbmFit$pred <- merge(gbmFit$pred, gbmFit$bestTune)
> gbmCM <- confusionMatrix(gbmFit, norm = "none")
> gbmCM
Repeated Train/Test Splits Estimated (1 reps, 0.75%) Confusion Matrix
(entries are un-normalized counts)
Confusion Matrix and Statistics
Reference
Prediction successful unsuccessful
successful 480 123
unsuccessful 90 864
Accuracy : 0.8632
95% CI : (0.8451, 0.8799)
No Information Rate : 0.6339
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.7088
Mcnemar's Test P-Value : 0.02834
Sensitivity : 0.8421
Specificity : 0.8754
Pos Pred Value : 0.7960
Neg Pred Value : 0.9057
Prevalence : 0.3661
Detection Rate : 0.3083
Detection Prevalence : 0.3873
Balanced Accuracy : 0.8587
'Positive' Class : successful
>
> gbmRoc <- roc(response = gbmFit$pred$obs,
+ predictor = gbmFit$pred$successful,
+ levels = rev(levels(gbmFit$pred$obs)))
>
> set.seed(476)
> gbmFactorFit <- train(x = training[,factorPredictors],
+ y = training$Class,
+ method = "gbm",
+ tuneGrid = gbmGrid,
+ verbose = FALSE,
+ metric = "ROC",
+ trControl = ctrl)
> gbmFactorFit
Stochastic Gradient Boosting
8190 samples
1488 predictors
2 classes: 'successful', 'unsuccessful'
No pre-processing
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%)
Summary of sample sizes: 6633
Resampling results across tuning parameters:
shrinkage interaction.depth n.trees ROC Sens Spec
0.01 1 100 0.881 0.658 0.797
0.01 1 200 0.886 0.872 0.821
0.01 1 300 0.887 0.882 0.824
0.01 1 400 0.888 0.886 0.8
0.01 1 500 0.886 0.886 0.8
0.01 1 600 0.883 0.888 0.799
0.01 1 700 0.883 0.888 0.799
0.01 1 800 0.881 0.888 0.799
0.01 1 900 0.883 0.884 0.8
0.01 1 1000 0.884 0.884 0.8
0.01 1 1100 0.885 0.884 0.801
0.01 1 1200 0.883 0.882 0.802
0.01 1 1300 0.88 0.882 0.8
0.01 1 1400 0.877 0.882 0.801
0.01 1 1500 0.873 0.884 0.8
0.01 1 1600 0.87 0.882 0.8
0.01 1 1700 0.869 0.881 0.802
0.01 1 1800 0.867 0.884 0.804
0.01 1 1900 0.866 0.884 0.803
0.01 1 2000 0.864 0.884 0.803
0.01 3 100 0.907 0.884 0.792
0.01 3 200 0.909 0.886 0.793
0.01 3 300 0.905 0.886 0.795
0.01 3 400 0.902 0.884 0.799
0.01 3 500 0.894 0.884 0.796
0.01 3 600 0.888 0.884 0.797
0.01 3 700 0.881 0.884 0.8
0.01 3 800 0.878 0.886 0.803
0.01 3 900 0.874 0.888 0.804
0.01 3 1000 0.873 0.886 0.802
0.01 3 1100 0.872 0.886 0.805
0.01 3 1200 0.872 0.884 0.806
0.01 3 1300 0.872 0.881 0.807
0.01 3 1400 0.872 0.882 0.806
0.01 3 1500 0.872 0.881 0.807
0.01 3 1600 0.872 0.882 0.809
0.01 3 1700 0.872 0.881 0.81
0.01 3 1800 0.872 0.888 0.81
0.01 3 1900 0.872 0.884 0.807
0.01 3 2000 0.873 0.881 0.81
0.01 5 100 0.909 0.86 0.805
0.01 5 200 0.906 0.875 0.792
0.01 5 300 0.899 0.879 0.799
0.01 5 400 0.894 0.882 0.798
0.01 5 500 0.886 0.882 0.798
0.01 5 600 0.881 0.882 0.801
0.01 5 700 0.878 0.879 0.802
0.01 5 800 0.877 0.879 0.803
0.01 5 900 0.876 0.877 0.803
0.01 5 1000 0.876 0.879 0.806
0.01 5 1100 0.876 0.879 0.806
0.01 5 1200 0.876 0.881 0.809
0.01 5 1300 0.876 0.879 0.806
0.01 5 1400 0.876 0.882 0.806
0.01 5 1500 0.876 0.884 0.809
0.01 5 1600 0.876 0.881 0.806
0.01 5 1700 0.876 0.882 0.806
0.01 5 1800 0.876 0.882 0.809
0.01 5 1900 0.876 0.879 0.805
0.01 5 2000 0.876 0.882 0.804
0.01 7 100 0.917 0.882 0.78
0.01 7 200 0.904 0.879 0.797
0.01 7 300 0.896 0.881 0.797
0.01 7 400 0.886 0.875 0.804
0.01 7 500 0.88 0.877 0.804
0.01 7 600 0.878 0.875 0.803
0.01 7 700 0.876 0.877 0.806
0.01 7 800 0.876 0.877 0.807
0.01 7 900 0.876 0.879 0.813
0.01 7 1000 0.876 0.879 0.811
0.01 7 1100 0.875 0.875 0.81
0.01 7 1200 0.875 0.875 0.811
0.01 7 1300 0.875 0.874 0.811
0.01 7 1400 0.875 0.875 0.811
0.01 7 1500 0.875 0.875 0.811
0.01 7 1600 0.875 0.874 0.811
0.01 7 1700 0.875 0.875 0.807
0.01 7 1800 0.875 0.875 0.806
0.01 7 1900 0.875 0.875 0.807
0.01 7 2000 0.875 0.877 0.811
0.01 9 100 0.913 0.882 0.789
0.01 9 200 0.904 0.881 0.789
0.01 9 300 0.893 0.879 0.795
0.01 9 400 0.883 0.881 0.804
0.01 9 500 0.879 0.881 0.806
0.01 9 600 0.877 0.879 0.806
0.01 9 700 0.876 0.881 0.811
0.01 9 800 0.876 0.881 0.811
0.01 9 900 0.875 0.881 0.811
0.01 9 1000 0.875 0.875 0.814
0.01 9 1100 0.875 0.874 0.81
0.01 9 1200 0.875 0.874 0.81
0.01 9 1300 0.875 0.874 0.81
0.01 9 1400 0.875 0.872 0.81
0.01 9 1500 0.875 0.874 0.81
0.01 9 1600 0.874 0.874 0.81
0.01 9 1700 0.874 0.875 0.811
0.01 9 1800 0.874 0.875 0.81
0.01 9 1900 0.874 0.879 0.809
0.01 9 2000 0.874 0.879 0.809
0.1 1 100 0.882 0.891 0.8
0.1 1 200 0.865 0.888 0.801
0.1 1 300 0.857 0.891 0.798
0.1 1 400 0.858 0.882 0.802
0.1 1 500 0.858 0.884 0.801
0.1 1 600 0.859 0.888 0.801
0.1 1 700 0.858 0.884 0.804
0.1 1 800 0.857 0.886 0.799
0.1 1 900 0.857 0.884 0.797
0.1 1 1000 0.856 0.886 0.8
0.1 1 1100 0.857 0.886 0.801
0.1 1 1200 0.856 0.889 0.801
0.1 1 1300 0.856 0.891 0.804
0.1 1 1400 0.855 0.886 0.801
0.1 1 1500 0.855 0.882 0.804
0.1 1 1600 0.855 0.884 0.807
0.1 1 1700 0.856 0.888 0.801
0.1 1 1800 0.855 0.882 0.811
0.1 1 1900 0.855 0.881 0.807
0.1 1 2000 0.855 0.888 0.811
0.1 3 100 0.875 0.886 0.799
0.1 3 200 0.873 0.882 0.813
0.1 3 300 0.872 0.891 0.81
0.1 3 400 0.872 0.889 0.809
0.1 3 500 0.871 0.888 0.812
0.1 3 600 0.87 0.893 0.812
0.1 3 700 0.87 0.888 0.811
0.1 3 800 0.87 0.889 0.81
0.1 3 900 0.869 0.881 0.813
0.1 3 1000 0.869 0.879 0.815
0.1 3 1100 0.869 0.879 0.814
0.1 3 1200 0.868 0.884 0.811
0.1 3 1300 0.868 0.872 0.812
0.1 3 1400 0.867 0.877 0.807
0.1 3 1500 0.865 0.874 0.811
0.1 3 1600 0.865 0.881 0.81
0.1 3 1700 0.864 0.877 0.812
0.1 3 1800 0.865 0.879 0.812
0.1 3 1900 0.865 0.879 0.815
0.1 3 2000 0.864 0.87 0.817
0.1 5 100 0.873 0.879 0.807
0.1 5 200 0.872 0.891 0.8
0.1 5 300 0.871 0.875 0.814
0.1 5 400 0.87 0.882 0.806
0.1 5 500 0.868 0.879 0.806
0.1 5 600 0.869 0.87 0.807
0.1 5 700 0.868 0.875 0.809
0.1 5 800 0.866 0.881 0.811
0.1 5 900 0.865 0.879 0.805
0.1 5 1000 0.865 0.879 0.806
0.1 5 1100 0.864 0.868 0.81
0.1 5 1200 0.863 0.877 0.807
0.1 5 1300 0.863 0.879 0.806
0.1 5 1400 0.863 0.875 0.805
0.1 5 1500 0.862 0.879 0.802
0.1 5 1600 0.862 0.872 0.806
0.1 5 1700 0.862 0.879 0.809
0.1 5 1800 0.862 0.877 0.807
0.1 5 1900 0.862 0.875 0.809
0.1 5 2000 0.861 0.879 0.803
0.1 7 100 0.876 0.893 0.809
0.1 7 200 0.873 0.879 0.804
0.1 7 300 0.87 0.882 0.799
0.1 7 400 0.868 0.882 0.798
0.1 7 500 0.864 0.879 0.8
0.1 7 600 0.863 0.879 0.804
0.1 7 700 0.863 0.87 0.802
0.1 7 800 0.863 0.872 0.802
0.1 7 900 0.863 0.874 0.801
0.1 7 1000 0.862 0.868 0.8
0.1 7 1100 0.861 0.863 0.794
0.1 7 1200 0.862 0.861 0.793
0.1 7 1300 0.861 0.863 0.796
0.1 7 1400 0.86 0.861 0.797
0.1 7 1500 0.86 0.867 0.796
0.1 7 1600 0.859 0.861 0.799
0.1 7 1700 0.859 0.87 0.797
0.1 7 1800 0.86 0.863 0.801
0.1 7 1900 0.86 0.868 0.799
0.1 7 2000 0.859 0.858 0.796
0.1 9 100 0.872 0.874 0.811
0.1 9 200 0.868 0.872 0.801
0.1 9 300 0.866 0.872 0.806
0.1 9 400 0.865 0.868 0.8
0.1 9 500 0.863 0.872 0.801
0.1 9 600 0.861 0.879 0.803
0.1 9 700 0.861 0.874 0.8
0.1 9 800 0.861 0.87 0.801
0.1 9 900 0.861 0.874 0.796
0.1 9 1000 0.86 0.868 0.795
0.1 9 1100 0.86 0.874 0.798
0.1 9 1200 0.859 0.868 0.797
0.1 9 1300 0.859 0.868 0.796
0.1 9 1400 0.859 0.87 0.797
0.1 9 1500 0.86 0.874 0.796
0.1 9 1600 0.859 0.868 0.796
0.1 9 1700 0.858 0.874 0.796
0.1 9 1800 0.86 0.874 0.799
0.1 9 1900 0.859 0.877 0.796
0.1 9 2000 0.859 0.879 0.795
ROC was used to select the optimal model using the largest value.
The final values used for the model were n.trees = 100, interaction.depth =
7 and shrinkage = 0.01.
>
> gbmFactorFit$pred <- merge(gbmFactorFit$pred, gbmFactorFit$bestTune)
> gbmFactorCM <- confusionMatrix(gbmFactorFit, norm = "none")
> gbmFactorCM
Repeated Train/Test Splits Estimated (1 reps, 0.75%) Confusion Matrix
(entries are un-normalized counts)
Confusion Matrix and Statistics
Reference
Prediction successful unsuccessful
successful 503 217
unsuccessful 67 770
Accuracy : 0.8176
95% CI : (0.7975, 0.8365)
No Information Rate : 0.6339
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.6277
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.8825
Specificity : 0.7801
Pos Pred Value : 0.6986
Neg Pred Value : 0.9200
Prevalence : 0.3661
Detection Rate : 0.3231
Detection Prevalence : 0.4624
Balanced Accuracy : 0.8313
'Positive' Class : successful
>
> gbmFactorRoc <- roc(response = gbmFactorFit$pred$obs,
+ predictor = gbmFactorFit$pred$successful,
+ levels = rev(levels(gbmFactorFit$pred$obs)))
>
> gbmROCRange <- extendrange(cbind(gbmFactorFit$results$ROC,gbmFit$results$ROC))
>
> plot(gbmFactorFit, ylim = gbmROCRange,
+ auto.key = list(columns = 4, lines = TRUE))
>
>
> plot(gbmFit, ylim = gbmROCRange,
+ auto.key = list(columns = 4, lines = TRUE))
>
>
> plot(treebagRoc, type = "s", col = rgb(.2, .2, .2, .2), legacy.axes = TRUE)
Call:
roc.default(response = treebagFit$pred$obs, predictor = treebagFit$pred$successful, levels = rev(levels(treebagFit$pred$obs)))
Data: treebagFit$pred$successful in 987 controls (treebagFit$pred$obs unsuccessful) < 570 cases (treebagFit$pred$obs successful).
Area under the curve: 0.9205
> plot(rpartRoc, type = "s", add = TRUE, col = rgb(.2, .2, .2, .2), legacy.axes = TRUE)
Call:
roc.default(response = rpartFit$pred$obs, predictor = rpartFit$pred$successful, levels = rev(levels(rpartFit$pred$obs)))
Data: rpartFit$pred$successful in 29610 controls (rpartFit$pred$obs unsuccessful) < 17100 cases (rpartFit$pred$obs successful).
Area under the curve: 0.8915
> plot(j48FactorRoc, type = "s", add = TRUE, col = rgb(.2, .2, .2, .2), legacy.axes = TRUE)
Call:
roc.default(response = j48FactorFit$pred$obs, predictor = j48FactorFit$pred$successful, levels = rev(levels(j48FactorFit$pred$obs)))
Data: j48FactorFit$pred$successful in 987 controls (j48FactorFit$pred$obs unsuccessful) < 570 cases (j48FactorFit$pred$obs successful).
Area under the curve: 0.8353
> plot(rfFactorRoc, type = "s", add = TRUE, col = rgb(.2, .2, .2, .2), legacy.axes = TRUE)
Call:
roc.default(response = rfFactorFit$pred$obs, predictor = rfFactorFit$pred$successful, levels = rev(levels(rfFactorFit$pred$obs)))
Data: rfFactorFit$pred$successful in 8883 controls (rfFactorFit$pred$obs unsuccessful) < 5130 cases (rfFactorFit$pred$obs successful).
Area under the curve: 0.9049
> plot(gbmRoc, type = "s", print.thres = c(.5), print.thres.pch = 3,
+ print.thres.pattern = "", print.thres.cex = 1.2,
+ add = TRUE, col = "red", print.thres.col = "red", legacy.axes = TRUE)
Call:
roc.default(response = gbmFit$pred$obs, predictor = gbmFit$pred$successful, levels = rev(levels(gbmFit$pred$obs)))
Data: gbmFit$pred$successful in 987 controls (gbmFit$pred$obs unsuccessful) < 570 cases (gbmFit$pred$obs successful).
Area under the curve: 0.9361
> plot(gbmFactorRoc, type = "s", print.thres = c(.5), print.thres.pch = 16,
+ legacy.axes = TRUE, print.thres.pattern = "", print.thres.cex = 1.2,
+ add = TRUE)
Call:
roc.default(response = gbmFactorFit$pred$obs, predictor = gbmFactorFit$pred$successful, levels = rev(levels(gbmFactorFit$pred$obs)))
Data: gbmFactorFit$pred$successful in 987 controls (gbmFactorFit$pred$obs unsuccessful) < 570 cases (gbmFactorFit$pred$obs successful).
Area under the curve: 0.9168
> legend(.75, .2,
+ c("Grouped Categories", "Independent Categories"),
+ lwd = c(1, 1),
+ col = c("black", "red"),
+ pch = c(16, 3))
>
> ################################################################################
> ### Section 14.5 C5.0
>
> c50Grid <- expand.grid(trials = c(1:9, (1:10)*10),
+ model = c("tree", "rules"),
+ winnow = c(TRUE, FALSE))
> set.seed(476)
> c50FactorFit <- train(training[,factorPredictors], training$Class,
+ method = "C5.0",
+ tuneGrid = c50Grid,
+ verbose = FALSE,
+ metric = "ROC",
+ trControl = ctrl)
Loading required package: C50
> c50FactorFit
C5.0
8190 samples
1488 predictors
2 classes: 'successful', 'unsuccessful'
No pre-processing
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%)
Summary of sample sizes: 6633
Resampling results across tuning parameters:
model winnow trials ROC Sens Spec
rules FALSE 1 0.877 0.886 0.796
rules FALSE 2 0.886 0.621 0.919
rules FALSE 3 0.9 0.782 0.844
rules FALSE 4 0.905 0.816 0.858
rules FALSE 5 0.907 0.802 0.846
rules FALSE 6 0.917 0.832 0.841
rules FALSE 7 0.922 0.796 0.873
rules FALSE 8 0.924 0.847 0.866
rules FALSE 9 0.923 0.832 0.867
rules FALSE 10 0.92 0.818 0.87
rules FALSE 20 0.934 0.823 0.888
rules FALSE 30 0.937 0.844 0.875
rules FALSE 40 0.938 0.844 0.88
rules FALSE 50 0.939 0.835 0.88
rules FALSE 60 0.94 0.842 0.882
rules FALSE 70 0.939 0.839 0.884
rules FALSE 80 0.941 0.847 0.886
rules FALSE 90 0.941 0.842 0.884
rules FALSE 100 0.942 0.849 0.888
rules TRUE 1 0.859 0.886 0.81
rules TRUE 2 0.892 0.784 0.851
rules TRUE 3 0.895 0.796 0.85
rules TRUE 4 0.914 0.811 0.862
rules TRUE 5 0.919 0.828 0.865
rules TRUE 6 0.923 0.795 0.875
rules TRUE 7 0.927 0.856 0.854
rules TRUE 8 0.93 0.818 0.876
rules TRUE 9 0.931 0.846 0.867
rules TRUE 10 0.932 0.854 0.869
rules TRUE 20 0.932 0.854 0.869
rules TRUE 30 0.933 0.849 0.869
rules TRUE 40 0.935 0.856 0.871
rules TRUE 50 0.936 0.856 0.87
rules TRUE 60 0.936 0.856 0.868
rules TRUE 70 0.936 0.868 0.867
rules TRUE 80 0.937 0.858 0.873
rules TRUE 90 0.937 0.867 0.869
rules TRUE 100 0.937 0.87 0.874
tree FALSE 1 0.906 0.874 0.832
tree FALSE 2 0.903 0.886 0.838
tree FALSE 3 0.908 0.809 0.853
tree FALSE 4 0.908 0.84 0.859
tree FALSE 5 0.909 0.818 0.835
tree FALSE 6 0.908 0.835 0.844
tree FALSE 7 0.909 0.825 0.835
tree FALSE 8 0.913 0.842 0.844
tree FALSE 9 0.921 0.847 0.839
tree FALSE 10 0.921 0.847 0.838
tree FALSE 20 0.929 0.853 0.855
tree FALSE 30 0.933 0.858 0.868
tree FALSE 40 0.934 0.853 0.875
tree FALSE 50 0.934 0.847 0.872
tree FALSE 60 0.935 0.86 0.872
tree FALSE 70 0.935 0.854 0.872
tree FALSE 80 0.935 0.856 0.867
tree FALSE 90 0.935 0.853 0.866
tree FALSE 100 0.936 0.847 0.867
tree TRUE 1 0.904 0.877 0.826
tree TRUE 2 0.895 0.874 0.85
tree TRUE 3 0.91 0.856 0.835
tree TRUE 4 0.911 0.826 0.83
tree TRUE 5 0.912 0.816 0.848
tree TRUE 6 0.918 0.856 0.852
tree TRUE 7 0.919 0.833 0.856
tree TRUE 8 0.92 0.837 0.854
tree TRUE 9 0.921 0.83 0.854
tree TRUE 10 0.923 0.833 0.846
tree TRUE 20 0.929 0.856 0.863
tree TRUE 30 0.932 0.867 0.86
tree TRUE 40 0.933 0.865 0.867
tree TRUE 50 0.934 0.868 0.873
tree TRUE 60 0.935 0.865 0.869
tree TRUE 70 0.934 0.877 0.854
tree TRUE 80 0.935 0.865 0.86
tree TRUE 90 0.934 0.861 0.869
tree TRUE 100 0.935 0.872 0.866
ROC was used to select the optimal model using the largest value.
The final values used for the model were trials = 100, model = rules and
winnow = FALSE.
>
> c50FactorFit$pred <- merge(c50FactorFit$pred, c50FactorFit$bestTune)
> c50FactorCM <- confusionMatrix(c50FactorFit, norm = "none")
> c50FactorCM
Repeated Train/Test Splits Estimated (1 reps, 0.75%) Confusion Matrix
(entries are un-normalized counts)
Confusion Matrix and Statistics
Reference
Prediction successful unsuccessful
successful 484 111
unsuccessful 86 876
Accuracy : 0.8735
95% CI : (0.8559, 0.8896)
No Information Rate : 0.6339
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.7299
Mcnemar's Test P-Value : 0.08728
Sensitivity : 0.8491
Specificity : 0.8875
Pos Pred Value : 0.8134
Neg Pred Value : 0.9106
Prevalence : 0.3661
Detection Rate : 0.3109
Detection Prevalence : 0.3821
Balanced Accuracy : 0.8683
'Positive' Class : successful
>
> c50FactorRoc <- roc(response = c50FactorFit$pred$obs,
+ predictor = c50FactorFit$pred$successful,
+ levels = rev(levels(c50FactorFit$pred$obs)))
>
> set.seed(476)
> c50Fit <- train(training[,fullSet], training$Class,
+ method = "C5.0",
+ tuneGrid = c50Grid,
+ metric = "ROC",
+ verbose = FALSE,
+ trControl = ctrl)
> c50Fit
C5.0
8190 samples
1070 predictors
2 classes: 'successful', 'unsuccessful'
No pre-processing
Resampling: Repeated Train/Test Splits Estimated (1 reps, 0.75%)
Summary of sample sizes: 6633
Resampling results across tuning parameters:
model winnow trials ROC Sens Spec
rules FALSE 1 0.893 0.768 0.87
rules FALSE 2 0.877 0.872 0.831
rules FALSE 3 0.896 0.747 0.874
rules FALSE 4 0.901 0.823 0.858
rules FALSE 5 0.901 0.753 0.883
rules FALSE 6 0.914 0.851 0.855
rules FALSE 7 0.919 0.805 0.87
rules FALSE 8 0.919 0.839 0.859
rules FALSE 9 0.924 0.833 0.872
rules FALSE 10 0.921 0.839 0.867
rules FALSE 20 0.928 0.846 0.866
rules FALSE 30 0.932 0.842 0.868
rules FALSE 40 0.934 0.84 0.872
rules FALSE 50 0.931 0.826 0.872
rules FALSE 60 0.933 0.842 0.872
rules FALSE 70 0.934 0.839 0.869
rules FALSE 80 0.935 0.84 0.873
rules FALSE 90 0.935 0.832 0.872
rules FALSE 100 0.935 0.844 0.871
rules TRUE 1 0.85 0.847 0.847
rules TRUE 2 0.882 0.868 0.829
rules TRUE 3 0.899 0.775 0.868
rules TRUE 4 0.91 0.854 0.834
rules TRUE 5 0.918 0.821 0.854
rules TRUE 6 0.915 0.839 0.839
rules TRUE 7 0.917 0.786 0.867
rules TRUE 8 0.921 0.842 0.853
rules TRUE 9 0.917 0.814 0.865
rules TRUE 10 0.919 0.825 0.862
rules TRUE 20 0.927 0.84 0.858
rules TRUE 30 0.923 0.809 0.869
rules TRUE 40 0.927 0.84 0.866
rules TRUE 50 0.927 0.844 0.862
rules TRUE 60 0.928 0.839 0.867
rules TRUE 70 0.928 0.837 0.866
rules TRUE 80 0.929 0.833 0.864
rules TRUE 90 0.93 0.823 0.873
rules TRUE 100 0.931 0.825 0.872
tree FALSE 1 0.9 0.753 0.878
tree FALSE 2 0.874 0.805 0.858
tree FALSE 3 0.908 0.758 0.872
tree FALSE 4 0.914 0.832 0.852
tree FALSE 5 0.921 0.814 0.857
tree FALSE 6 0.916 0.826 0.851
tree FALSE 7 0.921 0.805 0.869
tree FALSE 8 0.923 0.835 0.852
tree FALSE 9 0.924 0.809 0.866
tree FALSE 10 0.924 0.825 0.864
tree FALSE 20 0.932 0.823 0.873
tree FALSE 30 0.932 0.819 0.88
tree FALSE 40 0.932 0.828 0.881
tree FALSE 50 0.932 0.83 0.878
tree FALSE 60 0.933 0.842 0.874
tree FALSE 70 0.934 0.842 0.87
tree FALSE 80 0.934 0.835 0.868
tree FALSE 90 0.934 0.837 0.872
tree FALSE 100 0.935 0.842 0.875
tree TRUE 1 0.905 0.837 0.854
tree TRUE 2 0.877 0.782 0.851
tree TRUE 3 0.896 0.753 0.864
tree TRUE 4 0.902 0.774 0.862
tree TRUE 5 0.908 0.791 0.852
tree TRUE 6 0.908 0.805 0.856
tree TRUE 7 0.914 0.798 0.868
tree TRUE 8 0.915 0.795 0.865
tree TRUE 9 0.916 0.782 0.867
tree TRUE 10 0.919 0.809 0.864
tree TRUE 20 0.919 0.807 0.874
tree TRUE 30 0.926 0.804 0.873
tree TRUE 40 0.927 0.809 0.877
tree TRUE 50 0.928 0.814 0.873
tree TRUE 60 0.926 0.809 0.872
tree TRUE 70 0.928 0.812 0.871
tree TRUE 80 0.929 0.816 0.869
tree TRUE 90 0.929 0.816 0.872
tree TRUE 100 0.929 0.818 0.869
ROC was used to select the optimal model using the largest value.
The final values used for the model were trials = 90, model = rules and
winnow = FALSE.
>
> c50Fit$pred <- merge(c50Fit$pred, c50Fit$bestTune)
> c50CM <- confusionMatrix(c50Fit, norm = "none")
> c50CM
Repeated Train/Test Splits Estimated (1 reps, 0.75%) Confusion Matrix
(entries are un-normalized counts)
Confusion Matrix and Statistics
Reference
Prediction successful unsuccessful
successful 474 126
unsuccessful 96 861
Accuracy : 0.8574
95% CI : (0.8391, 0.8744)
No Information Rate : 0.6339
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.6962
Mcnemar's Test P-Value : 0.05161
Sensitivity : 0.8316
Specificity : 0.8723
Pos Pred Value : 0.7900
Neg Pred Value : 0.8997
Prevalence : 0.3661
Detection Rate : 0.3044
Detection Prevalence : 0.3854
Balanced Accuracy : 0.8520
'Positive' Class : successful
>
> c50Roc <- roc(response = c50Fit$pred$obs,
+ predictor = c50Fit$pred$successful,
+ levels = rev(levels(c50Fit$pred$obs)))
>
> update(plot(c50FactorFit), ylab = "ROC AUC (2008 Hold-Out Data)")
>
>
> plot(treebagRoc, type = "s", col = rgb(.2, .2, .2, .2), legacy.axes = TRUE)
Call:
roc.default(response = treebagFit$pred$obs, predictor = treebagFit$pred$successful, levels = rev(levels(treebagFit$pred$obs)))
Data: treebagFit$pred$successful in 987 controls (treebagFit$pred$obs unsuccessful) < 570 cases (treebagFit$pred$obs successful).
Area under the curve: 0.9205
> plot(rpartRoc, type = "s", add = TRUE, col = rgb(.2, .2, .2, .2), legacy.axes = TRUE)
Call:
roc.default(response = rpartFit$pred$obs, predictor = rpartFit$pred$successful, levels = rev(levels(rpartFit$pred$obs)))
Data: rpartFit$pred$successful in 29610 controls (rpartFit$pred$obs unsuccessful) < 17100 cases (rpartFit$pred$obs successful).
Area under the curve: 0.8915
> plot(j48FactorRoc, type = "s", add = TRUE, col = rgb(.2, .2, .2, .2), legacy.axes = TRUE)
Call:
roc.default(response = j48FactorFit$pred$obs, predictor = j48FactorFit$pred$successful, levels = rev(levels(j48FactorFit$pred$obs)))
Data: j48FactorFit$pred$successful in 987 controls (j48FactorFit$pred$obs unsuccessful) < 570 cases (j48FactorFit$pred$obs successful).
Area under the curve: 0.8353
> plot(rfFactorRoc, type = "s", add = TRUE, col = rgb(.2, .2, .2, .2), legacy.axes = TRUE)
Call:
roc.default(response = rfFactorFit$pred$obs, predictor = rfFactorFit$pred$successful, levels = rev(levels(rfFactorFit$pred$obs)))
Data: rfFactorFit$pred$successful in 8883 controls (rfFactorFit$pred$obs unsuccessful) < 5130 cases (rfFactorFit$pred$obs successful).
Area under the curve: 0.9049
> plot(gbmRoc, type = "s", col = rgb(.2, .2, .2, .2), add = TRUE, legacy.axes = TRUE)
Call:
roc.default(response = gbmFit$pred$obs, predictor = gbmFit$pred$successful, levels = rev(levels(gbmFit$pred$obs)))
Data: gbmFit$pred$successful in 987 controls (gbmFit$pred$obs unsuccessful) < 570 cases (gbmFit$pred$obs successful).
Area under the curve: 0.9361
> plot(c50Roc, type = "s", print.thres = c(.5), print.thres.pch = 3,
+ print.thres.pattern = "", print.thres.cex = 1.2,
+ add = TRUE, col = "red", print.thres.col = "red", legacy.axes = TRUE)
Call:
roc.default(response = c50Fit$pred$obs, predictor = c50Fit$pred$successful, levels = rev(levels(c50Fit$pred$obs)))
Data: c50Fit$pred$successful in 987 controls (c50Fit$pred$obs unsuccessful) < 570 cases (c50Fit$pred$obs successful).
Area under the curve: 0.9352
> plot(c50FactorRoc, type = "s", print.thres = c(.5), print.thres.pch = 16,
+ print.thres.pattern = "", print.thres.cex = 1.2,
+ add = TRUE, legacy.axes = TRUE)
Call:
roc.default(response = c50FactorFit$pred$obs, predictor = c50FactorFit$pred$successful, levels = rev(levels(c50FactorFit$pred$obs)))
Data: c50FactorFit$pred$successful in 987 controls (c50FactorFit$pred$obs unsuccessful) < 570 cases (c50FactorFit$pred$obs successful).
Area under the curve: 0.942
> legend(.75, .2,
+ c("Grouped Categories", "Independent Categories"),
+ lwd = c(1, 1),
+ col = c("black", "red"),
+ pch = c(16, 3))
>
> ################################################################################
> ### Section 14.7 Comparing Two Encodings of Categorical Predictors
>
> ## Pull the hold-out results from each model and merge
>
> rp1 <- caret:::getTrainPerf(rpartFit)
> names(rp1) <- gsub("Train", "Independent", names(rp1))
> rp2 <- caret:::getTrainPerf(rpartFactorFit)
> rp2$Label <- "CART"
> names(rp2) <- gsub("Train", "Grouped", names(rp2))
> rp <- cbind(rp1, rp2)
>
> j481 <- caret:::getTrainPerf(j48Fit)
> names(j481) <- gsub("Train", "Independent", names(j481))
> j482 <- caret:::getTrainPerf(j48FactorFit)
> j482$Label <- "J48"
> names(j482) <- gsub("Train", "Grouped", names(j482))
> j48 <- cbind(j481, j482)
>
> part1 <- caret:::getTrainPerf(partFit)
> names(part1) <- gsub("Train", "Independent", names(part1))
> part2 <- caret:::getTrainPerf(partFactorFit)
> part2$Label <- "PART"
> names(part2) <- gsub("Train", "Grouped", names(part2))
> part <- cbind(part1, part2)
>
> tb1 <- caret:::getTrainPerf(treebagFit)
> names(tb1) <- gsub("Train", "Independent", names(tb1))
> tb2 <- caret:::getTrainPerf(treebagFactorFit)
> tb2$Label <- "Bagged Tree"
> names(tb2) <- gsub("Train", "Grouped", names(tb2))
> tb <- cbind(tb1, tb2)
>
> rf1 <- caret:::getTrainPerf(rfFit)
> names(rf1) <- gsub("Train", "Independent", names(rf1))
> rf2 <- caret:::getTrainPerf(rfFactorFit)
> rf2$Label <- "Random Forest"
> names(rf2) <- gsub("Train", "Grouped", names(rf2))
> rf <- cbind(rf1, rf2)
>
> gbm1 <- caret:::getTrainPerf(gbmFit)
> names(gbm1) <- gsub("Train", "Independent", names(gbm1))
> gbm2 <- caret:::getTrainPerf(gbmFactorFit)
> gbm2$Label <- "Boosted Tree"
> names(gbm2) <- gsub("Train", "Grouped", names(gbm2))
> bst <- cbind(gbm1, gbm2)
>
>
> c501 <- caret:::getTrainPerf(c50Fit)
> names(c501) <- gsub("Train", "Independent", names(c501))
> c502 <- caret:::getTrainPerf(c50FactorFit)
> c502$Label <- "C5.0"
> names(c502) <- gsub("Train", "Grouped", names(c502))
> c5 <- cbind(c501, c502)
>
>
> trainPerf <- rbind(rp, j48, part, tb, rf, bst, c5)
>
> library(lattice)
> library(reshape2)
> trainPerf <- melt(trainPerf)
Using method, method, Label as id variables
> trainPerf$metric <- "ROC"
> trainPerf$metric[grepl("Sens", trainPerf$variable)] <- "Sensitivity"
> trainPerf$metric[grepl("Spec", trainPerf$variable)] <- "Specificity"
> trainPerf$model <- "Grouped"
> trainPerf$model[grepl("Independent", trainPerf$variable)] <- "Independent"
>
> trainPerf <- melt(trainPerf)
Using method, method.1, Label, variable, metric, model as id variables
> trainPerf$metric <- "ROC"
> trainPerf$metric[grepl("Sens", trainPerf$variable)] <- "Sensitivity"
> trainPerf$metric[grepl("Spec", trainPerf$variable)] <- "Specificity"
> trainPerf$model <- "Independent"
> trainPerf$model[grepl("Grouped", trainPerf$variable)] <- "Grouped"
> trainPerf$Label <- factor(trainPerf$Label,
+ levels = rev(c("CART", "Cond. Trees", "J48", "Ripper",
+ "PART", "Bagged Tree", "Random Forest",
+ "Boosted Tree", "C5.0")))
>
> dotplot(Label ~ value|metric,
+ data = trainPerf,
+ groups = model,
+ horizontal = TRUE,
+ auto.key = list(columns = 2),
+ between = list(x = 1),
+ xlab = "")
>
>
> sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
locale:
[1] C
attached base packages:
[1] parallel splines grid stats graphics grDevices utils
[8] datasets methods base
other attached packages:
[1] reshape2_1.2.2 C50_0.1.0-15 gbm_2.1 randomForest_4.6-7
[5] ipred_0.9-1 prodlim_1.3.7 nnet_7.3-6 survival_2.37-4
[9] MASS_7.3-26 RWeka_0.4-17 e1071_1.6-1 class_7.3-7
[13] partykit_0.1-5 pROC_1.5.4 plyr_1.8 rpart_4.1-1
[17] caret_6.0-22 ggplot2_0.9.3.1 lattice_0.20-15
loaded via a namespace (and not attached):
[1] KernSmooth_2.23-10 RColorBrewer_1.0-5 RWekajars_3.7.9-1 car_2.0-17
[5] codetools_0.2-8 colorspace_1.2-2 compiler_3.0.1 dichromat_2.0-0
[9] digest_0.6.3 foreach_1.4.0 gtable_0.1.2 iterators_1.0.6
[13] labeling_0.1 munsell_0.4 proto_0.3-10 rJava_0.9-4
[17] scales_0.2.3 stringr_0.6.2
>
> q("no")
> proc.time()
user system elapsed
208496.296 776.829 209791.456
%%R -w 600 -h 600
## runChapterScript(14)
## user system elapsed
## 208496.296 776.829 209791.456
NULL
%%R
showChapterScript(16)
NULL
%%R
showChapterOutput(16)
R Information
R version 3.0.1 (2013-05-16) -- "Good Sport"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin10.8.0 (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> ################################################################################
> ### R code from Applied Predictive Modeling (2013) by Kuhn and Johnson.
> ### Copyright 2013 Kuhn and Johnson
> ### Web Page: http://www.appliedpredictivemodeling.com
> ### Contact: Max Kuhn (mxkuhn@gmail.com)
> ###
> ### Chapter 16: Remedies for Severe Class Imbalance
> ###
> ### Required packages: AppliedPredictiveModeling, caret, C50, earth, DMwR,
> ### DVD, kernlab, mda, pROC, randomForest, rpart
> ###
> ### Data used: The insurance data from the DWD package.
> ###
> ### Notes:
> ### 1) This code is provided without warranty.
> ###
> ### 2) This code should help the user reproduce the results in the
> ### text. There will be differences between this code and what is is
> ### the computing section. For example, the computing sections show
> ### how the source functions work (e.g. randomForest() or plsr()),
> ### which were not directly used when creating the book. Also, there may be
> ### syntax differences that occur over time as packages evolve. These files
> ### will reflect those changes.
> ###
> ### 3) In some cases, the calculations in the book were run in
> ### parallel. The sub-processes may reset the random number seed.
> ### Your results may slightly vary.
> ###
> ################################################################################
>
> ################################################################################
> ### Section 16.1 Case Study: Predicting Caravan Policy Ownership
>
> library(DWD)
Loading required package: Matrix
Loading required package: lattice
> data(ticdata)
>
> ### Some of the predictor names and levels have characters that would results in
> ### illegal variable names. We convert then to more generic names and treat the
> ### ordered factors as nominal (i.e. unordered) factors.
>
> isOrdered <- unlist(lapply(ticdata, function(x) any(class(x) == "ordered")))
>
> recodeLevels <- function(x)
+ {
+ x <- gsub("f ", "", as.character(x))
+ x <- gsub(" - ", "_to_", x)
+ x <- gsub("-", "_to_", x)
+ x <- gsub("%", "", x)
+ x <- gsub("?", "Unk", x, fixed = TRUE)
+ x <- gsub("[,'\\(\\)]", "", x)
+ x <- gsub(" ", "_", x)
+ factor(paste("_", x, sep = ""))
+ }
>
> convertCols <- c("STYPE", "MGEMLEEF", "MOSHOOFD",
+ names(isOrdered)[isOrdered])
>
> for(i in convertCols) ticdata[,i] <- factor(gsub(" ", "0",format(as.numeric(ticdata[,i]))))
>
> ticdata$CARAVAN <- factor(as.character(ticdata$CARAVAN),
+ levels = rev(levels(ticdata$CARAVAN)))
>
> ### Split the data into three sets: training, test and evaluation.
> library(caret)
Loading required package: ggplot2
>
> set.seed(156)
>
> split1 <- createDataPartition(ticdata$CARAVAN, p = .7)[[1]]
>
> other <- ticdata[-split1,]
> training <- ticdata[ split1,]
>
> set.seed(934)
>
> split2 <- createDataPartition(other$CARAVAN, p = 1/3)[[1]]
>
> evaluation <- other[ split2,]
> testing <- other[-split2,]
>
> predictors <- names(training)[names(training) != "CARAVAN"]
>
> testResults <- data.frame(CARAVAN = testing$CARAVAN)
> evalResults <- data.frame(CARAVAN = evaluation$CARAVAN)
>
> trainingInd <- data.frame(model.matrix(CARAVAN ~ ., data = training))[,-1]
> evaluationInd <- data.frame(model.matrix(CARAVAN ~ ., data = evaluation))[,-1]
> testingInd <- data.frame(model.matrix(CARAVAN ~ ., data = testing))[,-1]
>
> trainingInd$CARAVAN <- training$CARAVAN
> evaluationInd$CARAVAN <- evaluation$CARAVAN
> testingInd$CARAVAN <- testing$CARAVAN
>
> isNZV <- nearZeroVar(trainingInd)
> noNZVSet <- names(trainingInd)[-isNZV]
>
> testResults <- data.frame(CARAVAN = testing$CARAVAN)
> evalResults <- data.frame(CARAVAN = evaluation$CARAVAN)
>
> ################################################################################
> ### Section 16.2 The Effect of Class Imbalance
>
> ### These functions are used to measure performance
>
> fiveStats <- function(...) c(twoClassSummary(...), defaultSummary(...))
> fourStats <- function (data, lev = levels(data$obs), model = NULL)
+ {
+
+ accKapp <- postResample(data[, "pred"], data[, "obs"])
+ out <- c(accKapp,
+ sensitivity(data[, "pred"], data[, "obs"], lev[1]),
+ specificity(data[, "pred"], data[, "obs"], lev[2]))
+ names(out)[3:4] <- c("Sens", "Spec")
+ out
+ }
>
> ctrl <- trainControl(method = "cv",
+ classProbs = TRUE,
+ summaryFunction = fiveStats)
>
> ctrlNoProb <- ctrl
> ctrlNoProb$summaryFunction <- fourStats
> ctrlNoProb$classProbs <- FALSE
>
>
> set.seed(1410)
> rfFit <- train(CARAVAN ~ ., data = trainingInd,
+ method = "rf",
+ trControl = ctrl,
+ ntree = 1500,
+ tuneLength = 5,
+ metric = "ROC")
Loading required package: randomForest
randomForest 4.6-7
Type rfNews() to see new features/changes/bug fixes.
Loading required package: pROC
Loading required package: plyr
Type 'citation("pROC")' for a citation.
Attaching package: ‘pROC’
The following object is masked from ‘package:stats’:
cov, smooth, var
Loading required package: class
> rfFit
Random Forest
6877 samples
503 predictors
2 classes: 'insurance', 'noinsurance'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 6190, 6190, 6188, 6189, 6189, 6190, ...
Resampling results across tuning parameters:
mtry ROC Sens Spec Accuracy Kappa ROC SD Sens SD Spec SD
2 0.608 0 1 0.94 0 0.0863 0 0
7 0.669 0 1 0.94 -0.000285 0.0335 0 0.00049
31 0.689 0.0146 0.993 0.934 0.0134 0.0376 0.0171 0.00373
126 0.696 0.0292 0.986 0.928 0.0233 0.0387 0.0193 0.0042
502 0.688 0.0365 0.98 0.923 0.0233 0.042 0.0208 0.00392
Accuracy SD Kappa SD
0.000422 0
0.000602 0.000901
0.00447 0.0341
0.00475 0.0338
0.00445 0.0335
ROC was used to select the optimal model using the largest value.
The final value used for the model was mtry = 126.
>
> evalResults$RF <- predict(rfFit, evaluationInd, type = "prob")[,1]
> testResults$RF <- predict(rfFit, testingInd, type = "prob")[,1]
> rfROC <- roc(evalResults$CARAVAN, evalResults$RF,
+ levels = rev(levels(evalResults$CARAVAN)))
> rfROC
Call:
roc.default(response = evalResults$CARAVAN, predictor = evalResults$RF, levels = rev(levels(evalResults$CARAVAN)))
Data: evalResults$RF in 924 controls (evalResults$CARAVAN noinsurance) < 59 cases (evalResults$CARAVAN insurance).
Area under the curve: 0.7596
>
> rfEvalCM <- confusionMatrix(predict(rfFit, evaluationInd), evalResults$CARAVAN)
> rfEvalCM
Confusion Matrix and Statistics
Reference
Prediction insurance noinsurance
insurance 4 9
noinsurance 55 915
Accuracy : 0.9349
95% CI : (0.9176, 0.9495)
No Information Rate : 0.94
P-Value [Acc > NIR] : 0.7727
Kappa : 0.0914
Mcnemar's Test P-Value : 1.855e-08
Sensitivity : 0.067797
Specificity : 0.990260
Pos Pred Value : 0.307692
Neg Pred Value : 0.943299
Prevalence : 0.060020
Detection Rate : 0.004069
Detection Prevalence : 0.013225
Balanced Accuracy : 0.529028
'Positive' Class : insurance
>
> set.seed(1410)
> lrFit <- train(CARAVAN ~ .,
+ data = trainingInd[, noNZVSet],
+ method = "glm",
+ trControl = ctrl,
+ metric = "ROC")
There were 20 warnings (use warnings() to see them)
> lrFit
Generalized Linear Model
6877 samples
203 predictors
2 classes: 'insurance', 'noinsurance'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 6190, 6190, 6188, 6189, 6189, 6190, ...
Resampling results
ROC Sens Spec Accuracy Kappa ROC SD Sens SD Spec SD Accuracy SD
0.702 0.0121 0.998 0.939 0.0179 0.0488 0.0128 0.0032 0.00323
Kappa SD
0.0249
>
> evalResults$LogReg <- predict(lrFit, evaluationInd[, noNZVSet], type = "prob")[,1]
Warning messages:
1: In predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
prediction from a rank-deficient fit may be misleading
2: In predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
prediction from a rank-deficient fit may be misleading
> testResults$LogReg <- predict(lrFit, testingInd[, noNZVSet], type = "prob")[,1]
Warning messages:
1: In predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
prediction from a rank-deficient fit may be misleading
2: In predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
prediction from a rank-deficient fit may be misleading
> lrROC <- roc(evalResults$CARAVAN, evalResults$LogReg,
+ levels = rev(levels(evalResults$CARAVAN)))
> lrROC
Call:
roc.default(response = evalResults$CARAVAN, predictor = evalResults$LogReg, levels = rev(levels(evalResults$CARAVAN)))
Data: evalResults$LogReg in 924 controls (evalResults$CARAVAN noinsurance) < 59 cases (evalResults$CARAVAN insurance).
Area under the curve: 0.7267
>
> lrEvalCM <- confusionMatrix(predict(lrFit, evaluationInd), evalResults$CARAVAN)
Warning message:
In predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
prediction from a rank-deficient fit may be misleading
> lrEvalCM
Confusion Matrix and Statistics
Reference
Prediction insurance noinsurance
insurance 1 2
noinsurance 58 922
Accuracy : 0.939
95% CI : (0.9221, 0.9531)
No Information Rate : 0.94
P-Value [Acc > NIR] : 0.5872
Kappa : 0.0266
Mcnemar's Test P-Value : 1.243e-12
Sensitivity : 0.016949
Specificity : 0.997835
Pos Pred Value : 0.333333
Neg Pred Value : 0.940816
Prevalence : 0.060020
Detection Rate : 0.001017
Detection Prevalence : 0.003052
Balanced Accuracy : 0.507392
'Positive' Class : insurance
>
> set.seed(1401)
> fdaFit <- train(CARAVAN ~ ., data = training,
+ method = "fda",
+ tuneGrid = data.frame(degree = 1, nprune = 1:25),
+ metric = "ROC",
+ trControl = ctrl)
Loading required package: earth
Loading required package: leaps
Loading required package: plotmo
Loading required package: plotrix
Loading required package: mda
> fdaFit
Flexible Discriminant Analysis
6877 samples
85 predictors
2 classes: 'insurance', 'noinsurance'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 6189, 6190, 6190, 6189, 6189, 6189, ...
Resampling results across tuning parameters:
nprune ROC Sens Spec Accuracy Kappa ROC SD Sens SD Spec SD
1 0.5 0 1 0.94 0 0 0 0
2 0.664 0 1 0.94 0 0.0291 0 0
3 0.691 0 0.999 0.94 -0.0011 0.0272 0 0.00149
4 0.705 0.0146 0.997 0.938 0.0201 0.0333 0.0171 0.00231
5 0.704 0.0146 0.997 0.938 0.0206 0.0303 0.0171 0.00251
6 0.723 0.0244 0.997 0.938 0.0358 0.0325 0.0304 0.00204
7 0.724 0.0268 0.995 0.937 0.035 0.0323 0.0372 0.00292
8 0.724 0.0268 0.995 0.937 0.0347 0.0316 0.0372 0.00311
9 0.728 0.0293 0.995 0.937 0.0383 0.0315 0.0378 0.0032
10 0.727 0.0317 0.994 0.936 0.0393 0.0339 0.0382 0.00482
11 0.73 0.0366 0.993 0.936 0.0475 0.0351 0.0368 0.00484
12 0.73 0.0415 0.992 0.936 0.0531 0.0325 0.0364 0.00452
13 0.734 0.0488 0.993 0.936 0.0651 0.0385 0.0398 0.00411
14 0.73 0.0488 0.992 0.935 0.0626 0.034 0.0415 0.004
15 0.732 0.0463 0.992 0.935 0.0599 0.0327 0.0422 0.00307
16 0.728 0.0537 0.991 0.935 0.0707 0.0356 0.0427 0.00311
17 0.732 0.0512 0.991 0.935 0.0647 0.0353 0.0437 0.00409
18 0.731 0.0512 0.991 0.935 0.0648 0.0362 0.0466 0.00398
19 0.729 0.0488 0.991 0.934 0.0597 0.0369 0.0488 0.00425
20 0.727 0.0488 0.991 0.934 0.0599 0.0364 0.0488 0.00399
21 0.727 0.0488 0.991 0.934 0.0599 0.0364 0.0488 0.00399
22 0.727 0.0488 0.991 0.934 0.0599 0.0364 0.0488 0.00399
23 0.727 0.0488 0.991 0.934 0.0599 0.0364 0.0488 0.00399
24 0.727 0.0488 0.991 0.934 0.0599 0.0364 0.0488 0.00399
25 0.727 0.0488 0.991 0.934 0.0599 0.0364 0.0488 0.00399
Accuracy SD Kappa SD
0.000452 0
0.000452 0
0.0014 0.00265
0.00222 0.0286
0.00209 0.0281
0.00268 0.0509
0.0023 0.0541
0.0026 0.0544
0.0026 0.0553
0.00315 0.0509
0.00346 0.0495
0.00292 0.0481
0.00295 0.0557
0.00267 0.0575
0.00267 0.0614
0.00337 0.0637
0.00339 0.0624
0.00327 0.0652
0.00331 0.0679
0.00324 0.0685
0.00324 0.0685
0.00324 0.0685
0.00324 0.0685
0.00324 0.0685
0.00324 0.0685
Tuning parameter 'degree' was held constant at a value of 1
ROC was used to select the optimal model using the largest value.
The final values used for the model were degree = 1 and nprune = 13.
>
> evalResults$FDA <- predict(fdaFit, evaluation[, predictors], type = "prob")[,1]
> testResults$FDA <- predict(fdaFit, testing[, predictors], type = "prob")[,1]
> fdaROC <- roc(evalResults$CARAVAN, evalResults$FDA,
+ levels = rev(levels(evalResults$CARAVAN)))
> fdaROC
Call:
roc.default(response = evalResults$CARAVAN, predictor = evalResults$FDA, levels = rev(levels(evalResults$CARAVAN)))
Data: evalResults$FDA in 924 controls (evalResults$CARAVAN noinsurance) < 59 cases (evalResults$CARAVAN insurance).
Area under the curve: 0.754
>
> fdaEvalCM <- confusionMatrix(predict(fdaFit, evaluation[, predictors]), evalResults$CARAVAN)
> fdaEvalCM
Confusion Matrix and Statistics
Reference
Prediction insurance noinsurance
insurance 1 3
noinsurance 58 921
Accuracy : 0.9379
95% CI : (0.921, 0.9522)
No Information Rate : 0.94
P-Value [Acc > NIR] : 0.638
Kappa : 0.0243
Mcnemar's Test P-Value : 4.712e-12
Sensitivity : 0.016949
Specificity : 0.996753
Pos Pred Value : 0.250000
Neg Pred Value : 0.940756
Prevalence : 0.060020
Detection Rate : 0.001017
Detection Prevalence : 0.004069
Balanced Accuracy : 0.506851
'Positive' Class : insurance
>
>
> labs <- c(RF = "Random Forest", LogReg = "Logistic Regression",
+ FDA = "FDA (MARS)")
> lift1 <- lift(CARAVAN ~ RF + LogReg + FDA, data = evalResults,
+ labels = labs)
>
> plotTheme <- caretTheme()
>
> plot(fdaROC, type = "S", col = plotTheme$superpose.line$col[3], legacy.axes = TRUE)
Call:
roc.default(response = evalResults$CARAVAN, predictor = evalResults$FDA, levels = rev(levels(evalResults$CARAVAN)))
Data: evalResults$FDA in 924 controls (evalResults$CARAVAN noinsurance) < 59 cases (evalResults$CARAVAN insurance).
Area under the curve: 0.754
> plot(rfROC, type = "S", col = plotTheme$superpose.line$col[1], add = TRUE, legacy.axes = TRUE)
Call:
roc.default(response = evalResults$CARAVAN, predictor = evalResults$RF, levels = rev(levels(evalResults$CARAVAN)))
Data: evalResults$RF in 924 controls (evalResults$CARAVAN noinsurance) < 59 cases (evalResults$CARAVAN insurance).
Area under the curve: 0.7596
> plot(lrROC, type = "S", col = plotTheme$superpose.line$col[2], add = TRUE, legacy.axes = TRUE)
Call:
roc.default(response = evalResults$CARAVAN, predictor = evalResults$LogReg, levels = rev(levels(evalResults$CARAVAN)))
Data: evalResults$LogReg in 924 controls (evalResults$CARAVAN noinsurance) < 59 cases (evalResults$CARAVAN insurance).
Area under the curve: 0.7267
> legend(.7, .25,
+ c("Random Forest", "Logistic Regression", "FDA (MARS)"),
+ cex = .85,
+ col = plotTheme$superpose.line$col[1:3],
+ lwd = rep(2, 3),
+ lty = rep(1, 3))
>
> xyplot(lift1,
+ ylab = "%Events Found",
+ xlab = "%Customers Evaluated",
+ lwd = 2,
+ type = "l")
>
>
> ################################################################################
> ### Section 16.4 Alternate Cutoffs
>
> rfThresh <- coords(rfROC, x = "best", ret="threshold",
+ best.method="closest.topleft")
> rfThreshY <- coords(rfROC, x = "best", ret="threshold",
+ best.method="youden")
>
> cutText <- ifelse(rfThresh == rfThreshY,
+ "is the same as",
+ "is similar to")
>
> evalResults$rfAlt <- factor(ifelse(evalResults$RF > rfThresh,
+ "insurance", "noinsurance"),
+ levels = levels(evalResults$CARAVAN))
> testResults$rfAlt <- factor(ifelse(testResults$RF > rfThresh,
+ "insurance", "noinsurance"),
+ levels = levels(testResults$CARAVAN))
> rfAltEvalCM <- confusionMatrix(evalResults$rfAlt, evalResults$CARAVAN)
> rfAltEvalCM
Confusion Matrix and Statistics
Reference
Prediction insurance noinsurance
insurance 39 257
noinsurance 20 667
Accuracy : 0.7182
95% CI : (0.689, 0.7462)
No Information Rate : 0.94
P-Value [Acc > NIR] : 1
Kappa : 0.1329
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.66102
Specificity : 0.72186
Pos Pred Value : 0.13176
Neg Pred Value : 0.97089
Prevalence : 0.06002
Detection Rate : 0.03967
Detection Prevalence : 0.30112
Balanced Accuracy : 0.69144
'Positive' Class : insurance
>
> rfAltTestCM <- confusionMatrix(testResults$rfAlt, testResults$CARAVAN)
> rfAltTestCM
Confusion Matrix and Statistics
Reference
Prediction insurance noinsurance
insurance 71 467
noinsurance 45 1379
Accuracy : 0.739
95% CI : (0.719, 0.7584)
No Information Rate : 0.9409
P-Value [Acc > NIR] : 1
Kappa : 0.1328
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.61207
Specificity : 0.74702
Pos Pred Value : 0.13197
Neg Pred Value : 0.96840
Prevalence : 0.05912
Detection Rate : 0.03619
Detection Prevalence : 0.27421
Balanced Accuracy : 0.67954
'Positive' Class : insurance
>
> rfTestCM <- confusionMatrix(predict(rfFit, testingInd), testResults$CARAVAN)
>
>
> plot(rfROC, print.thres = c(.5, .3, .10, rfThresh), type = "S",
+ print.thres.pattern = "%.3f (Spec = %.2f, Sens = %.2f)",
+ print.thres.cex = .8, legacy.axes = TRUE)
Call:
roc.default(response = evalResults$CARAVAN, predictor = evalResults$RF, levels = rev(levels(evalResults$CARAVAN)))
Data: evalResults$RF in 924 controls (evalResults$CARAVAN noinsurance) < 59 cases (evalResults$CARAVAN insurance).
Area under the curve: 0.7596
>
> ################################################################################
> ### Section 16.5 Adjusting Prior Probabilities
>
> priors <- table(ticdata$CARAVAN)/nrow(ticdata)*100
> fdaPriors <- fdaFit
> fdaPriors$finalModel$prior <- c(insurance = .6, noinsurance = .4)
> fdaPriorPred <- predict(fdaPriors, evaluation[,predictors])
> evalResults$FDAprior <- predict(fdaPriors, evaluation[,predictors], type = "prob")[,1]
> testResults$FDAprior <- predict(fdaPriors, testing[,predictors], type = "prob")[,1]
> fdaPriorCM <- confusionMatrix(fdaPriorPred, evaluation$CARAVAN)
> fdaPriorCM
Confusion Matrix and Statistics
Reference
Prediction insurance noinsurance
insurance 42 306
noinsurance 17 618
Accuracy : 0.6714
95% CI : (0.6411, 0.7007)
No Information Rate : 0.94
P-Value [Acc > NIR] : 1
Kappa : 0.1156
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.71186
Specificity : 0.66883
Pos Pred Value : 0.12069
Neg Pred Value : 0.97323
Prevalence : 0.06002
Detection Rate : 0.04273
Detection Prevalence : 0.35402
Balanced Accuracy : 0.69035
'Positive' Class : insurance
>
> fdaPriorROC <- roc(testResults$CARAVAN, testResults$FDAprior,
+ levels = rev(levels(testResults$CARAVAN)))
> fdaPriorROC
Call:
roc.default(response = testResults$CARAVAN, predictor = testResults$FDAprior, levels = rev(levels(testResults$CARAVAN)))
Data: testResults$FDAprior in 1846 controls (testResults$CARAVAN noinsurance) < 116 cases (testResults$CARAVAN insurance).
Area under the curve: 0.7469
>
> ################################################################################
> ### Section 16.7 Sampling Methods
>
> set.seed(1237)
> downSampled <- downSample(trainingInd[, -ncol(trainingInd)], training$CARAVAN)
>
> set.seed(1237)
> upSampled <- upSample(trainingInd[, -ncol(trainingInd)], training$CARAVAN)
>
> library(DMwR)
Loading required package: xts
Loading required package: zoo
Attaching package: ‘zoo’
The following object is masked from ‘package:base’:
as.Date, as.Date.numeric
Loading required package: quantmod
Loading required package: Defaults
Loading required package: TTR
Version 0.4-0 included new data defaults. See ?getSymbols.
Loading required package: ROCR
Loading required package: gplots
Loading required package: gtools
Attaching package: ‘gtools’
The following object is masked from ‘package:e1071’:
permutations
Loading required package: gdata
gdata: read.xls support for 'XLS' (Excel 97-2004) files ENABLED.
gdata: read.xls support for 'XLSX' (Excel 2007+) files ENABLED.
Attaching package: ‘gdata’
The following object is masked from ‘package:randomForest’:
combine
The following object is masked from ‘package:stats’:
nobs
The following object is masked from ‘package:utils’:
object.size
Loading required package: caTools
Loading required package: grid
Loading required package: KernSmooth
KernSmooth 2.23 loaded
Copyright M. P. Wand 1997-2009
Loading required package: MASS
Attaching package: ‘gplots’
The following object is masked from ‘package:plotrix’:
plotCI
The following object is masked from ‘package:stats’:
lowess
Loading required package: rpart
Loading required package: abind
Loading required package: cluster
Attaching package: ‘DMwR’
The following object is masked from ‘package:plyr’:
join
Warning message:
'.path.package' is deprecated.
Use 'path.package' instead.
See help("Deprecated")
> set.seed(1237)
> smoted <- SMOTE(CARAVAN ~ ., data = trainingInd)
>
> set.seed(1410)
> rfDown <- train(Class ~ ., data = downSampled,
+ "rf",
+ trControl = ctrl,
+ ntree = 1500,
+ tuneLength = 5,
+ metric = "ROC")
> rfDown
Random Forest
822 samples
503 predictors
2 classes: 'insurance', 'noinsurance'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 740, 740, 739, 739, 740, 740, ...
Resampling results across tuning parameters:
mtry ROC Sens Spec Accuracy Kappa ROC SD Sens SD Spec SD
2 0.698 0.652 0.648 0.65 0.3 0.0724 0.0921 0.12
7 0.682 0.608 0.677 0.642 0.285 0.0712 0.0715 0.1
31 0.69 0.623 0.662 0.642 0.285 0.0582 0.0719 0.079
126 0.698 0.628 0.657 0.642 0.285 0.056 0.0655 0.0886
502 0.683 0.618 0.63 0.624 0.248 0.0575 0.0516 0.0818
Accuracy SD Kappa SD
0.0764 0.152
0.064 0.128
0.0513 0.103
0.0489 0.0979
0.0413 0.0827
ROC was used to select the optimal model using the largest value.
The final value used for the model was mtry = 126.
>
> evalResults$RFdown <- predict(rfDown, evaluationInd, type = "prob")[,1]
> testResults$RFdown <- predict(rfDown, testingInd, type = "prob")[,1]
> rfDownROC <- roc(evalResults$CARAVAN, evalResults$RFdown,
+ levels = rev(levels(evalResults$CARAVAN)))
> rfDownROC
Call:
roc.default(response = evalResults$CARAVAN, predictor = evalResults$RFdown, levels = rev(levels(evalResults$CARAVAN)))
Data: evalResults$RFdown in 924 controls (evalResults$CARAVAN noinsurance) < 59 cases (evalResults$CARAVAN insurance).
Area under the curve: 0.7922
>
> set.seed(1401)
> rfDownInt <- train(CARAVAN ~ ., data = trainingInd,
+ "rf",
+ ntree = 1500,
+ tuneLength = 5,
+ strata = training$CARAVAN,
+ sampsize = rep(sum(training$CARAVAN == "insurance"), 2),
+ metric = "ROC",
+ trControl = ctrl)
> rfDownInt
Random Forest
6877 samples
503 predictors
2 classes: 'insurance', 'noinsurance'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 6189, 6190, 6190, 6189, 6189, 6189, ...
Resampling results across tuning parameters:
mtry ROC Sens Spec Accuracy Kappa ROC SD Sens SD Spec SD
2 0.703 0.144 0.97 0.92 0.138 0.0353 0.0409 0.00587
7 0.704 0.424 0.835 0.81 0.133 0.0284 0.0737 0.0204
31 0.72 0.414 0.857 0.831 0.154 0.0286 0.0601 0.0188
126 0.722 0.424 0.841 0.816 0.14 0.0306 0.0667 0.0171
502 0.718 0.465 0.824 0.802 0.141 0.0356 0.0692 0.021
Accuracy SD Kappa SD
0.00682 0.0535
0.0176 0.0323
0.0183 0.0401
0.0167 0.0374
0.0201 0.0373
ROC was used to select the optimal model using the largest value.
The final value used for the model was mtry = 126.
>
> evalResults$RFdownInt <- predict(rfDownInt, evaluationInd, type = "prob")[,1]
> testResults$RFdownInt <- predict(rfDownInt, testingInd, type = "prob")[,1]
> rfDownIntRoc <- roc(evalResults$CARAVAN,
+ evalResults$RFdownInt,
+ levels = rev(levels(training$CARAVAN)))
> rfDownIntRoc
Call:
roc.default(response = evalResults$CARAVAN, predictor = evalResults$RFdownInt, levels = rev(levels(training$CARAVAN)))
Data: evalResults$RFdownInt in 924 controls (evalResults$CARAVAN noinsurance) < 59 cases (evalResults$CARAVAN insurance).
Area under the curve: 0.7962
>
> set.seed(1410)
> rfUp <- train(Class ~ ., data = upSampled,
+ "rf",
+ trControl = ctrl,
+ ntree = 1500,
+ tuneLength = 5,
+ metric = "ROC")
> rfUp
Random Forest
12932 samples
503 predictors
2 classes: 'insurance', 'noinsurance'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 11640, 11638, 11639, 11638, 11638, 11640, ...
Resampling results across tuning parameters:
mtry ROC Sens Spec Accuracy Kappa ROC SD Sens SD Spec SD
2 0.865 0.836 0.731 0.783 0.567 0.0115 0.00971 0.0186
7 0.987 0.992 0.861 0.927 0.853 0.00354 0.00375 0.0226
31 0.993 0.999 0.938 0.968 0.937 0.00309 0.00167 0.0127
126 0.992 1 0.95 0.975 0.95 0.00345 0 0.0103
502 0.992 1 0.943 0.971 0.943 0.00379 0 0.0136
Accuracy SD Kappa SD
0.0112 0.0224
0.01 0.02
0.00668 0.0134
0.00515 0.0103
0.00681 0.0136
ROC was used to select the optimal model using the largest value.
The final value used for the model was mtry = 31.
>
> evalResults$RFup <- predict(rfUp, evaluationInd, type = "prob")[,1]
> testResults$RFup <- predict(rfUp, testingInd, type = "prob")[,1]
> rfUpROC <- roc(evalResults$CARAVAN, evalResults$RFup,
+ levels = rev(levels(evalResults$CARAVAN)))
> rfUpROC
Call:
roc.default(response = evalResults$CARAVAN, predictor = evalResults$RFup, levels = rev(levels(evalResults$CARAVAN)))
Data: evalResults$RFup in 924 controls (evalResults$CARAVAN noinsurance) < 59 cases (evalResults$CARAVAN insurance).
Area under the curve: 0.7336
>
> set.seed(1410)
> rfSmote <- train(CARAVAN ~ ., data = smoted,
+ "rf",
+ trControl = ctrl,
+ ntree = 1500,
+ tuneLength = 5,
+ metric = "ROC")
> rfSmote
Random Forest
2877 samples
503 predictors
2 classes: 'insurance', 'noinsurance'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 2590, 2589, 2589, 2590, 2588, 2590, ...
Resampling results across tuning parameters:
mtry ROC Sens Spec Accuracy Kappa ROC SD Sens SD Spec SD
2 0.906 0.666 0.998 0.856 0.693 0.0215 0.0322 0.00409
7 0.908 0.69 0.973 0.852 0.687 0.0177 0.0299 0.0241
31 0.914 0.731 0.947 0.854 0.695 0.0168 0.0243 0.0223
126 0.918 0.736 0.942 0.853 0.693 0.0146 0.0231 0.0208
502 0.912 0.742 0.923 0.845 0.678 0.0151 0.0201 0.0306
Accuracy SD Kappa SD
0.0142 0.0314
0.0215 0.0451
0.0183 0.0378
0.0154 0.0319
0.0211 0.0428
ROC was used to select the optimal model using the largest value.
The final value used for the model was mtry = 126.
>
> evalResults$RFsmote <- predict(rfSmote, evaluationInd, type = "prob")[,1]
> testResults$RFsmote <- predict(rfSmote, testingInd, type = "prob")[,1]
> rfSmoteROC <- roc(evalResults$CARAVAN, evalResults$RFsmote,
+ levels = rev(levels(evalResults$CARAVAN)))
> rfSmoteROC
Call:
roc.default(response = evalResults$CARAVAN, predictor = evalResults$RFsmote, levels = rev(levels(evalResults$CARAVAN)))
Data: evalResults$RFsmote in 924 controls (evalResults$CARAVAN noinsurance) < 59 cases (evalResults$CARAVAN insurance).
Area under the curve: 0.7675
>
> rfSmoteCM <- confusionMatrix(predict(rfSmote, evaluationInd), evalResults$CARAVAN)
> rfSmoteCM
Confusion Matrix and Statistics
Reference
Prediction insurance noinsurance
insurance 11 50
noinsurance 48 874
Accuracy : 0.9003
95% CI : (0.8799, 0.9183)
No Information Rate : 0.94
P-Value [Acc > NIR] : 1.0000
Kappa : 0.1303
Mcnemar's Test P-Value : 0.9195
Sensitivity : 0.18644
Specificity : 0.94589
Pos Pred Value : 0.18033
Neg Pred Value : 0.94794
Prevalence : 0.06002
Detection Rate : 0.01119
Detection Prevalence : 0.06205
Balanced Accuracy : 0.56616
'Positive' Class : insurance
>
> samplingSummary <- function(x, evl, tst)
+ {
+ lvl <- rev(levels(tst$CARAVAN))
+ evlROC <- roc(evl$CARAVAN,
+ predict(x, evl, type = "prob")[,1],
+ levels = lvl)
+ rocs <- c(auc(evlROC),
+ auc(roc(tst$CARAVAN,
+ predict(x, tst, type = "prob")[,1],
+ levels = lvl)))
+ cut <- coords(evlROC, x = "best", ret="threshold",
+ best.method="closest.topleft")
+ bestVals <- coords(evlROC, cut, ret=c("sensitivity", "specificity"))
+ out <- c(rocs, bestVals*100)
+ names(out) <- c("evROC", "tsROC", "tsSens", "tsSpec")
+ out
+
+ }
>
> rfResults <- rbind(samplingSummary(rfFit, evaluationInd, testingInd),
+ samplingSummary(rfDown, evaluationInd, testingInd),
+ samplingSummary(rfDownInt, evaluationInd, testingInd),
+ samplingSummary(rfUp, evaluationInd, testingInd),
+ samplingSummary(rfSmote, evaluationInd, testingInd))
> rownames(rfResults) <- c("Original", "Down--Sampling", "Down--Sampling (Internal)",
+ "Up--Sampling", "SMOTE")
>
> rfResults
evROC tsROC tsSens tsSpec
Original 0.7596119 0.7360673 66.10169 72.18615
Down--Sampling 0.7921894 0.7291301 86.44068 67.74892
Down--Sampling (Internal) 0.7961516 0.7649158 66.10169 80.30303
Up--Sampling 0.7336195 0.7408283 72.88136 63.96104
SMOTE 0.7675178 0.7318643 81.35593 65.36797
>
> rocCols <- c("black", rgb(1, 0, 0, .5), rgb(0, 0, 1, .5))
>
> plot(roc(testResults$CARAVAN, testResults$RF, levels = rev(levels(testResults$CARAVAN))),
+ type = "S", col = rocCols[1], legacy.axes = TRUE)
Call:
roc.default(response = testResults$CARAVAN, predictor = testResults$RF, levels = rev(levels(testResults$CARAVAN)))
Data: testResults$RF in 1846 controls (testResults$CARAVAN noinsurance) < 116 cases (testResults$CARAVAN insurance).
Area under the curve: 0.7361
> plot(roc(testResults$CARAVAN, testResults$RFdownInt, levels = rev(levels(testResults$CARAVAN))),
+ type = "S", col = rocCols[2],add = TRUE, legacy.axes = TRUE)
Call:
roc.default(response = testResults$CARAVAN, predictor = testResults$RFdownInt, levels = rev(levels(testResults$CARAVAN)))
Data: testResults$RFdownInt in 1846 controls (testResults$CARAVAN noinsurance) < 116 cases (testResults$CARAVAN insurance).
Area under the curve: 0.7649
> plot(roc(testResults$CARAVAN, testResults$RFsmote, levels = rev(levels(testResults$CARAVAN))),
+ type = "S", col = rocCols[3], add = TRUE, legacy.axes = TRUE)
Call:
roc.default(response = testResults$CARAVAN, predictor = testResults$RFsmote, levels = rev(levels(testResults$CARAVAN)))
Data: testResults$RFsmote in 1846 controls (testResults$CARAVAN noinsurance) < 116 cases (testResults$CARAVAN insurance).
Area under the curve: 0.7319
> legend(.6, .4,
+ c("Normal", "Down-Sampling (Internal)", "SMOTE"),
+ lty = rep(1, 3),
+ lwd = rep(2, 3),
+ cex = .8,
+ col = rocCols)
>
> xyplot(lift(CARAVAN ~ RF + RFdownInt + RFsmote,
+ data = testResults),
+ type = "l",
+ ylab = "%Events Found",
+ xlab = "%Customers Evaluated")
>
>
> ################################################################################
> ### Section 16.8 Cost–Sensitive Training
>
> library(kernlab)
>
> set.seed(1157)
> sigma <- sigest(CARAVAN ~ ., data = trainingInd[, noNZVSet], frac = .75)
> names(sigma) <- NULL
>
> svmGrid1 <- data.frame(sigma = sigma[2],
+ C = 2^c(2:10))
>
> set.seed(1401)
> svmFit <- train(CARAVAN ~ .,
+ data = trainingInd[, noNZVSet],
+ method = "svmRadial",
+ tuneGrid = svmGrid1,
+ preProc = c("center", "scale"),
+ metric = "Kappa",
+ trControl = ctrl)
> svmFit
Support Vector Machines with Radial Basis Function Kernel
6877 samples
203 predictors
2 classes: 'insurance', 'noinsurance'
Pre-processing: centered, scaled
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 6189, 6190, 6190, 6189, 6189, 6189, ...
Resampling results across tuning parameters:
C ROC Sens Spec Accuracy Kappa ROC SD Sens SD Spec SD
4 0.665 0 1 0.94 -0.000285 0.0465 0 0.000489
8 0.671 0 1 0.94 0 0.0476 0 0
16 0.678 0 1 0.94 0 0.041 0 0
32 0.678 0 1 0.94 0 0.0368 0 0
64 0.668 0 1 0.94 0 0.0399 0 0
128 0.655 0 1 0.94 0 0.039 0 0
256 0.648 0 1 0.94 0 0.0395 0 0
512 0.644 0 1 0.94 0 0.0401 0 0
1020 0.643 0 1 0.94 0 0.037 0 0
Accuracy SD Kappa SD
6e-04 9e-04
0.000452 0
0.000452 0
0.000452 0
0.000452 0
0.000452 0
0.000452 0
0.000452 0
0.000452 0
Tuning parameter 'sigma' was held constant at a value of 0.002454182
Kappa was used to select the optimal model using the largest value.
The final values used for the model were sigma = 0.00245 and C = 8.
>
> evalResults$SVM <- predict(svmFit, evaluationInd[, noNZVSet], type = "prob")[,1]
> testResults$SVM <- predict(svmFit, testingInd[, noNZVSet], type = "prob")[,1]
> svmROC <- roc(evalResults$CARAVAN, evalResults$SVM,
+ levels = rev(levels(evalResults$CARAVAN)))
> svmROC
Call:
roc.default(response = evalResults$CARAVAN, predictor = evalResults$SVM, levels = rev(levels(evalResults$CARAVAN)))
Data: evalResults$SVM in 924 controls (evalResults$CARAVAN noinsurance) < 59 cases (evalResults$CARAVAN insurance).
Area under the curve: 0.6952
>
> svmTestROC <- roc(testResults$CARAVAN, testResults$SVM,
+ levels = rev(levels(testResults$CARAVAN)))
> svmTestROC
Call:
roc.default(response = testResults$CARAVAN, predictor = testResults$SVM, levels = rev(levels(testResults$CARAVAN)))
Data: testResults$SVM in 1846 controls (testResults$CARAVAN noinsurance) < 116 cases (testResults$CARAVAN insurance).
Area under the curve: 0.6974
>
> confusionMatrix(predict(svmFit, evaluationInd[, noNZVSet]), evalResults$CARAVAN)
Confusion Matrix and Statistics
Reference
Prediction insurance noinsurance
insurance 0 0
noinsurance 59 924
Accuracy : 0.94
95% CI : (0.9233, 0.954)
No Information Rate : 0.94
P-Value [Acc > NIR] : 0.5346
Kappa : 0
Mcnemar's Test P-Value : 4.321e-14
Sensitivity : 0.00000
Specificity : 1.00000
Pos Pred Value : NaN
Neg Pred Value : 0.93998
Prevalence : 0.06002
Detection Rate : 0.00000
Detection Prevalence : 0.00000
Balanced Accuracy : 0.50000
'Positive' Class : insurance
>
> confusionMatrix(predict(svmFit, testingInd[, noNZVSet]), testingInd$CARAVAN)
Confusion Matrix and Statistics
Reference
Prediction insurance noinsurance
insurance 0 0
noinsurance 116 1846
Accuracy : 0.9409
95% CI : (0.9295, 0.9509)
No Information Rate : 0.9409
P-Value [Acc > NIR] : 0.5247
Kappa : 0
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.00000
Specificity : 1.00000
Pos Pred Value : NaN
Neg Pred Value : 0.94088
Prevalence : 0.05912
Detection Rate : 0.00000
Detection Prevalence : 0.00000
Balanced Accuracy : 0.50000
'Positive' Class : insurance
>
>
> set.seed(1401)
> svmWtFit <- train(CARAVAN ~ .,
+ data = trainingInd[, noNZVSet],
+ method = "svmRadial",
+ tuneGrid = svmGrid1,
+ preProc = c("center", "scale"),
+ metric = "Kappa",
+ class.weights = c(insurance = 18, noinsurance = 1),
+ trControl = ctrlNoProb)
> svmWtFit
Support Vector Machines with Radial Basis Function Kernel
6877 samples
203 predictors
2 classes: 'insurance', 'noinsurance'
Pre-processing: centered, scaled
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 6189, 6190, 6190, 6189, 6189, 6189, ...
Resampling results across tuning parameters:
C Accuracy Kappa Sens Spec Accuracy SD Kappa SD Sens SD Spec SD
4 0.818 0.105 0.343 0.849 0.016 0.0399 0.0853 0.0186
8 0.842 0.116 0.309 0.876 0.0142 0.0339 0.0605 0.0159
16 0.855 0.105 0.256 0.893 0.0192 0.0442 0.0602 0.0207
32 0.869 0.11 0.234 0.909 0.0159 0.0507 0.0633 0.0167
64 0.876 0.0948 0.195 0.919 0.0173 0.0426 0.0435 0.0179
128 0.879 0.0865 0.175 0.924 0.0155 0.049 0.0503 0.0155
256 0.88 0.0843 0.17 0.925 0.0154 0.0419 0.0386 0.0154
512 0.879 0.0739 0.161 0.925 0.015 0.0501 0.0557 0.0157
1020 0.88 0.073 0.158 0.925 0.0148 0.0511 0.0569 0.0154
Tuning parameter 'sigma' was held constant at a value of 0.002454182
Kappa was used to select the optimal model using the largest value.
The final values used for the model were sigma = 0.00245 and C = 8.
>
> svmWtEvalCM <- confusionMatrix(predict(svmWtFit, evaluationInd[, noNZVSet]), evalResults$CARAVAN)
> svmWtEvalCM
Confusion Matrix and Statistics
Reference
Prediction insurance noinsurance
insurance 17 123
noinsurance 42 801
Accuracy : 0.8321
95% CI : (0.8073, 0.855)
No Information Rate : 0.94
P-Value [Acc > NIR] : 1
Kappa : 0.0944
Mcnemar's Test P-Value : 4.725e-10
Sensitivity : 0.28814
Specificity : 0.86688
Pos Pred Value : 0.12143
Neg Pred Value : 0.95018
Prevalence : 0.06002
Detection Rate : 0.01729
Detection Prevalence : 0.14242
Balanced Accuracy : 0.57751
'Positive' Class : insurance
>
> svmWtTestCM <- confusionMatrix(predict(svmWtFit, testingInd[, noNZVSet]), testingInd$CARAVAN)
> svmWtTestCM
Confusion Matrix and Statistics
Reference
Prediction insurance noinsurance
insurance 40 223
noinsurance 76 1623
Accuracy : 0.8476
95% CI : (0.8309, 0.8632)
No Information Rate : 0.9409
P-Value [Acc > NIR] : 1
Kappa : 0.1406
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.34483
Specificity : 0.87920
Pos Pred Value : 0.15209
Neg Pred Value : 0.95527
Prevalence : 0.05912
Detection Rate : 0.02039
Detection Prevalence : 0.13405
Balanced Accuracy : 0.61201
'Positive' Class : insurance
>
>
> initialRpart <- rpart(CARAVAN ~ ., data = training,
+ control = rpart.control(cp = 0.0001))
> rpartGrid <- data.frame(cp = initialRpart$cptable[, "CP"])
>
> cmat <- list(loss = matrix(c(0, 1, 20, 0), ncol = 2))
> set.seed(1401)
> cartWMod <- train(x = training[,predictors],
+ y = training$CARAVAN,
+ method = "rpart",
+ trControl = ctrlNoProb,
+ tuneGrid = rpartGrid,
+ metric = "Kappa",
+ parms = cmat)
> cartWMod
CART
6877 samples
85 predictors
2 classes: 'insurance', 'noinsurance'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 6189, 6190, 6190, 6189, 6189, 6189, ...
Resampling results across tuning parameters:
cp Accuracy Kappa Sens Spec Accuracy SD Kappa SD Sens SD
1e-04 0.797 0.0734 0.316 0.828 0.018 0.0435 0.0918
0.000487 0.797 0.0744 0.319 0.827 0.0189 0.0423 0.0892
0.00122 0.778 0.0768 0.36 0.805 0.02 0.037 0.0883
0.00162 0.762 0.0844 0.411 0.785 0.0181 0.0298 0.0794
0.00243 0.722 0.0805 0.48 0.737 0.024 0.0253 0.0786
0.00278 0.707 0.0773 0.499 0.72 0.0229 0.0299 0.0916
Spec SD
0.0203
0.0212
0.0229
0.0208
0.0274
0.0256
Kappa was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.00162.
>
>
> library(C50)
> c5Grid <- expand.grid(model = c("tree", "rules"),
+ trials = c(1, (1:10)*10),
+ winnow = FALSE)
>
> finalCost <- matrix(c(0, 20, 1, 0), ncol = 2)
> rownames(finalCost) <- colnames(finalCost) <- levels(training$CARAVAN)
> set.seed(1401)
> C5CostFit <- train(training[, predictors],
+ training$CARAVAN,
+ method = "C5.0",
+ metric = "Kappa",
+ tuneGrid = c5Grid,
+ cost = finalCost,
+ control = C5.0Control(earlyStopping = FALSE),
+ trControl = ctrlNoProb)
>
> C5CostCM <- confusionMatrix(predict(C5CostFit, testing), testing$CARAVAN)
> C5CostCM
Confusion Matrix and Statistics
Reference
Prediction insurance noinsurance
insurance 64 623
noinsurance 52 1223
Accuracy : 0.656
95% CI : (0.6345, 0.677)
No Information Rate : 0.9409
P-Value [Acc > NIR] : 1
Kappa : 0.0648
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.55172
Specificity : 0.66251
Pos Pred Value : 0.09316
Neg Pred Value : 0.95922
Prevalence : 0.05912
Detection Rate : 0.03262
Detection Prevalence : 0.35015
Balanced Accuracy : 0.60712
'Positive' Class : insurance
>
>
> ################################################################################
> ### Session Information
>
> sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] grid stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] C50_0.1.0-14 kernlab_0.9-16 DMwR_0.3.0 cluster_1.14.4
[5] abind_1.4-0 rpart_4.1-1 ROCR_1.0-4 gplots_2.11.0
[9] MASS_7.3-26 KernSmooth_2.23-10 caTools_1.14 gdata_2.12.0
[13] gtools_2.7.0 quantmod_0.4-0 TTR_0.21-1 Defaults_1.1-1
[17] xts_0.9-3 zoo_1.7-9 mda_0.4-2 earth_3.2-3
[21] plotrix_3.4-6 plotmo_1.3-2 leaps_2.9 e1071_1.6-1
[25] class_7.3-7 pROC_1.5.4 plyr_1.8 randomForest_4.6-7
[29] caret_6.0-22 ggplot2_0.9.3.1 DWD_0.10 Matrix_1.0-12
[33] lattice_0.20-15
loaded via a namespace (and not attached):
[1] bitops_1.0-5 car_2.0-16 codetools_0.2-8 colorspace_1.2-1
[5] compiler_3.0.1 dichromat_2.0-0 digest_0.6.3 foreach_1.4.0
[9] gtable_0.1.2 iterators_1.0.6 labeling_0.1 munsell_0.4
[13] proto_0.3-10 RColorBrewer_1.0-5 reshape2_1.2.2 scales_0.2.3
[17] stringr_0.6.2 tools_3.0.1
>
> q("no")
> proc.time()
user system elapsed
243437.520 682.066 244138.032
%%R -w 600 -h 600
## runChapterScript(16)
## user system elapsed
## 243437.520 682.066 244138.032
NULL
%%R
showChapterScript(17)
NULL
%%R
showChapterOutput(17)
R Information
R version 3.0.1 (2013-05-16) -- "Good Sport"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin10.8.0 (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> ################################################################################
> ### R code from Applied Predictive Modeling (2013) by Kuhn and Johnson.
> ### Copyright 2013 Kuhn and Johnson
> ### Web Page: http://www.appliedpredictivemodeling.com
> ### Contact: Max Kuhn (mxkuhn@gmail.com)
> ###
> ### Chapter 17: Case Study: Job Scheduling
> ###
> ### Required packages: AppliedPredictiveModeling, C50, caret, doMC (optional),
> ### earth, Hmisc, ipred, tabplot, kernlab, lattice, MASS,
> ### mda, nnet, pls, randomForest, rpart, sparseLDA,
> ###
> ### Data used: The HPC job scheduling data in the AppliedPredictiveModeling
> ### package.
> ###
> ### Notes:
> ### 1) This code is provided without warranty.
> ###
> ### 2) This code should help the user reproduce the results in the
> ### text. There will be differences between this code and what is is
> ### the computing section. For example, the computing sections show
> ### how the source functions work (e.g. randomForest() or plsr()),
> ### which were not directly used when creating the book. Also, there may be
> ### syntax differences that occur over time as packages evolve. These files
> ### will reflect those changes.
> ###
> ### 3) In some cases, the calculations in the book were run in
> ### parallel. The sub-processes may reset the random number seed.
> ### Your results may slightly vary.
> ###
> ################################################################################
>
> library(AppliedPredictiveModeling)
> data(schedulingData)
>
> ### Make a vector of predictor names
> predictors <- names(schedulingData)[!(names(schedulingData) %in% c("Class"))]
>
> ### A few summaries and plots of the data
> library(Hmisc)
Loading required package: survival
Loading required package: splines
Hmisc library by Frank E Harrell Jr
Type library(help='Hmisc'), ?Overview, or ?Hmisc.Overview')
to see overall documentation.
NOTE:Hmisc no longer redefines [.factor to drop unused levels when
subsetting. To get the old behavior of Hmisc type dropUnusedLevels().
Attaching package: ‘Hmisc’
The following object is masked from ‘package:survival’:
untangle.specials
The following object is masked from ‘package:base’:
format.pval, round.POSIXt, trunc.POSIXt, units
> describe(schedulingData)
schedulingData
8 Variables 4331 Observations
--------------------------------------------------------------------------------
Protocol
n missing unique
4331 0 14
A C D E F G H I J K L M N O
Frequency 94 160 149 96 170 155 321 381 989 6 242 451 536 581
% 2 4 3 2 4 4 7 9 23 0 6 10 12 13
--------------------------------------------------------------------------------
Compounds
n missing unique Mean .05 .10 .25 .50 .75 .90
4331 0 858 497.7 27 37 98 226 448 967
.95
2512
lowest : 20 21 22 23 24, highest: 14087 14090 14091 14097 14103
--------------------------------------------------------------------------------
InputFields
n missing unique Mean .05 .10 .25 .50 .75 .90
4331 0 1730 1537 26 48 134 426 991 4165
.95
7594
lowest : 10 11 12 13 14, highest: 36021 45420 45628 55920 56671
--------------------------------------------------------------------------------
Iterations
n missing unique Mean .05 .10 .25 .50 .75 .90
4331 0 11 29.24 10 20 20 20 20 50
.95
100
10 11 15 20 30 40 50 100 125 150 200
Frequency 272 9 2 3568 3 7 153 188 1 2 126
% 6 0 0 82 0 0 4 4 0 0 3
--------------------------------------------------------------------------------
NumPending
n missing unique Mean .05 .10 .25 .50 .75 .90
4331 0 303 53.39 0.0 0.0 0.0 0.0 0.0 33.0
.95
145.5
lowest : 0 1 2 3 4, highest: 3822 3870 3878 5547 5605
--------------------------------------------------------------------------------
Hour
n missing unique Mean .05 .10 .25 .50 .75 .90
4331 0 924 13.73 7.025 9.333 10.900 14.017 16.600 18.250
.95
19.658
lowest : 0.01667 0.03333 0.08333 0.10000 0.11667
highest: 23.20000 23.21667 23.35000 23.80000 23.98333
--------------------------------------------------------------------------------
Day
n missing unique
4331 0 7
Mon Tue Wed Thu Fri Sat Sun
Frequency 692 900 903 720 923 32 161
% 16 21 21 17 21 1 4
--------------------------------------------------------------------------------
Class
n missing unique
4331 0 4
VF (2211, 51%), F (1347, 31%), M (514, 12%), L (259, 6%)
--------------------------------------------------------------------------------
>
> library(tabplot)
Loading required package: ffbase
Loading required package: ff
Loading required package: tools
Loading required package: bit
Attaching package bit
package:bit (c) 2008-2012 Jens Oehlschlaegel (GPL-2)
creators: bit bitwhich
coercion: as.logical as.integer as.bit as.bitwhich which
operator: ! & | xor != ==
querying: print length any all min max range sum summary
bit access: length<- [ [<- [[ [[<-
for more help type ?bit
Attaching package: ‘bit’
The following object is masked from ‘package:base’:
xor
Attaching package ff
- getOption("fftempdir")=="/var/folders/Zf/ZfjbGEqKH2GPlbqofbYnBU+++TI/-Tmp-//RtmpZwCCTR"
- getOption("ffextension")=="ff"
- getOption("ffdrop")==TRUE
- getOption("fffinonexit")==TRUE
- getOption("ffpagesize")==65536
- getOption("ffcaching")=="mmnoflush" -- consider "ffeachflush" if your system stalls on large writes
- getOption("ffbatchbytes")==16777216 -- consider a different value for tuning your system
- getOption("ffmaxbytes")==536870912 -- consider a different value for tuning your system
Attaching package: ‘ff’
The following object is masked from ‘package:utils’:
write.csv, write.csv2
The following object is masked from ‘package:base’:
is.factor, is.ordered
Attaching package: ‘ffbase’
The following object is masked from ‘package:base’:
%in%
Loading required package: grid
> tableplot(schedulingData[, c( "Class", predictors)])
>
> mosaicplot(table(schedulingData$Protocol,
+ schedulingData$Class),
+ main = "")
>
> library(lattice)
> xyplot(Compounds ~ InputFields|Protocol,
+ data = schedulingData,
+ scales = list(x = list(log = 10), y = list(log = 10)),
+ groups = Class,
+ xlab = "Input Fields",
+ auto.key = list(columns = 4),
+ aspect = 1,
+ as.table = TRUE)
>
>
> ################################################################################
> ### Section 17.1 Data Splitting and Model Strategy
>
> ## Split the data
>
> library(caret)
Loading required package: ggplot2
Attaching package: ‘caret’
The following object is masked from ‘package:survival’:
cluster
> set.seed(1104)
> inTrain <- createDataPartition(schedulingData$Class, p = .8, list = FALSE)
>
> ### There are a lot of zeros and the distribution is skewed. We add
> ### one so that we can log transform the data
> schedulingData$NumPending <- schedulingData$NumPending + 1
>
> trainData <- schedulingData[ inTrain,]
> testData <- schedulingData[-inTrain,]
>
> ### Create a main effects only model formula to use
> ### repeatedly. Another formula with nonlinear effects is created
> ### below.
> modForm <- as.formula(Class ~ Protocol + log10(Compounds) +
+ log10(InputFields)+ log10(Iterations) +
+ log10(NumPending) + Hour + Day)
>
> ### Create an expanded set of predictors with interactions.
>
> modForm2 <- as.formula(Class ~ (Protocol + log10(Compounds) +
+ log10(InputFields)+ log10(Iterations) +
+ log10(NumPending) + Hour + Day)^2)
>
>
> ### Some of these terms will not be estimable. For example, if there
> ### are no data points were a particular protocol was run on a
> ### particular day, the full interaction cannot be computed. We use
> ### model.matrix() to create the whole set of predictor columns, then
> ### remove those that are zero variance
>
> expandedTrain <- model.matrix(modForm2, data = trainData)
> expandedTest <- model.matrix(modForm2, data = testData)
> expandedTrain <- as.data.frame(expandedTrain)
> expandedTest <- as.data.frame(expandedTest)
>
> ### Some models have issues when there is a zero variance predictor
> ### within the data of a particular class, so we used caret's
> ### checkConditionalX() function to find the offending columns and
> ### remove them
>
> zv <- checkConditionalX(expandedTrain, trainData$Class)
>
> ### Keep the expanded set to use for models where we must manually add
> ### more complex terms (such as logistic regression)
>
> expandedTrain <- expandedTrain[,-zv]
> expandedTest <- expandedTest[, -zv]
>
> ### Create the cost matrix
> costMatrix <- ifelse(diag(4) == 1, 0, 1)
> costMatrix[4, 1] <- 10
> costMatrix[3, 1] <- 5
> costMatrix[4, 2] <- 5
> costMatrix[3, 2] <- 5
> rownames(costMatrix) <- colnames(costMatrix) <- levels(trainData$Class)
>
> ### Create a cost function
> cost <- function(pred, obs)
+ {
+ isNA <- is.na(pred)
+ if(!all(isNA))
+ {
+ pred <- pred[!isNA]
+ obs <- obs[!isNA]
+
+ cost <- ifelse(pred == obs, 0, 1)
+ if(any(pred == "VF" & obs == "L")) cost[pred == "L" & obs == "VF"] <- 10
+ if(any(pred == "F" & obs == "L")) cost[pred == "F" & obs == "L"] <- 5
+ if(any(pred == "F" & obs == "M")) cost[pred == "F" & obs == "M"] <- 5
+ if(any(pred == "VF" & obs == "M")) cost[pred == "VF" & obs == "M"] <- 5
+ out <- mean(cost)
+ } else out <- NA
+ out
+ }
>
> ### Make a summary function that can be used with caret's train() function
> costSummary <- function (data, lev = NULL, model = NULL)
+ {
+ if (is.character(data$obs)) data$obs <- factor(data$obs, levels = lev)
+ c(postResample(data[, "pred"], data[, "obs"]),
+ Cost = cost(data[, "pred"], data[, "obs"]))
+ }
>
> ### Create a control object for the models
> ctrl <- trainControl(method = "repeatedcv",
+ repeats = 5,
+ summaryFunction = costSummary)
>
> ### Optional: parallel processing can be used via the 'do' packages,
> ### such as doMC, doMPI etc. We used doMC (not on Windows) to speed
> ### up the computations.
>
> ### WARNING: Be aware of how much memory is needed to parallel
> ### process. It can very quickly overwhelm the available hardware. The
> ### estimate of the median memory usage (VSIZE = total memory size)
> ### was 3300-4100M per core although the some calculations require as
> ### much as 3400M without parallel processing.
>
> library(doMC)
Loading required package: foreach
Loading required package: iterators
Loading required package: parallel
> registerDoMC(14)
>
> ### Fit the CART model with and without costs
>
> set.seed(857)
> rpFit <- train(x = trainData[, predictors],
+ y = trainData$Class,
+ method = "rpart",
+ metric = "Cost",
+ maximize = FALSE,
+ tuneLength = 20,
+ trControl = ctrl)
Loading required package: rpart
Loading required package: class
Attaching package: ‘e1071’
The following object is masked from ‘package:Hmisc’:
impute
> rpFit
CART
3467 samples
7 predictors
4 classes: 'VF', 'F', 'M', 'L'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 3120, 3120, 3120, 3121, 3120, 3120, ...
Resampling results across tuning parameters:
cp Accuracy Kappa Cost Accuracy SD Kappa SD Cost SD
0.00236 0.774 0.631 0.51 0.0193 0.0323 0.0617
0.00249 0.773 0.63 0.514 0.0193 0.0319 0.0591
0.00294 0.768 0.621 0.537 0.0176 0.0305 0.0514
0.00324 0.766 0.617 0.542 0.0169 0.0298 0.0521
0.00353 0.764 0.611 0.55 0.017 0.03 0.0491
0.00383 0.762 0.607 0.56 0.0182 0.0321 0.0538
0.00471 0.76 0.603 0.569 0.0193 0.0345 0.0607
0.0053 0.758 0.597 0.58 0.0183 0.0326 0.0567
0.00589 0.756 0.594 0.585 0.0201 0.0355 0.0591
0.00648 0.751 0.586 0.604 0.0205 0.036 0.059
0.00824 0.735 0.558 0.647 0.0184 0.0327 0.0491
0.00942 0.727 0.544 0.663 0.0184 0.0328 0.0476
0.00982 0.723 0.539 0.667 0.0181 0.0325 0.047
0.01 0.719 0.532 0.67 0.0175 0.0317 0.0454
0.0159 0.703 0.505 0.697 0.0192 0.0327 0.0518
0.0171 0.698 0.495 0.717 0.0179 0.032 0.0586
0.0183 0.693 0.482 0.755 0.0208 0.0409 0.0797
0.0205 0.67 0.42 0.871 0.0227 0.0493 0.0626
0.0383 0.652 0.376 0.969 0.0177 0.0346 0.0517
0.274 0.568 0.159 0.992 0.0609 0.168 0.0323
Cost was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.00236.
>
> set.seed(857)
> rpFitCost <- train(x = trainData[, predictors],
+ y = trainData$Class,
+ method = "rpart",
+ metric = "Cost",
+ maximize = FALSE,
+ tuneLength = 20,
+ parms =list(loss = costMatrix),
+ trControl = ctrl)
> rpFitCost
CART
3467 samples
7 predictors
4 classes: 'VF', 'F', 'M', 'L'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 3120, 3120, 3120, 3121, 3120, 3120, ...
Resampling results across tuning parameters:
cp Accuracy Kappa Cost Accuracy SD Kappa SD Cost SD
0.00236 0.72 0.565 0.343 0.0161 0.0248 0.0325
0.00249 0.718 0.562 0.344 0.0162 0.0248 0.0336
0.00294 0.717 0.56 0.344 0.0186 0.0277 0.0349
0.00324 0.717 0.56 0.345 0.0182 0.0272 0.0344
0.00353 0.713 0.555 0.35 0.0197 0.0293 0.0362
0.00383 0.707 0.545 0.358 0.0201 0.0297 0.038
0.00471 0.699 0.533 0.366 0.0205 0.0297 0.0386
0.0053 0.685 0.513 0.381 0.0196 0.0281 0.0376
0.00589 0.675 0.501 0.392 0.0207 0.0288 0.0378
0.00648 0.656 0.479 0.403 0.0372 0.0482 0.0461
0.00824 0.63 0.449 0.428 0.0451 0.0555 0.0476
0.00942 0.623 0.44 0.436 0.0574 0.0687 0.0478
0.00982 0.62 0.436 0.443 0.0581 0.0697 0.0457
0.01 0.617 0.433 0.445 0.0583 0.0699 0.0436
0.0159 0.53 0.324 0.507 0.0257 0.0303 0.0312
0.0171 0.52 0.306 0.526 0.0201 0.0223 0.0276
0.0183 0.521 0.305 0.527 0.0194 0.0219 0.0277
0.0205 0.515 0.295 0.532 0.0187 0.0231 0.0299
0.0383 0.503 0.275 0.546 0.0161 0.0179 0.0269
0.274 0.119 0 0.881 0.00104 0 0.00104
Cost was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.00236.
>
> set.seed(857)
> ldaFit <- train(x = expandedTrain,
+ y = trainData$Class,
+ method = "lda",
+ metric = "Cost",
+ maximize = FALSE,
+ trControl = ctrl)
Loading required package: MASS
> ldaFit
Linear Discriminant Analysis
3467 samples
112 predictors
4 classes: 'VF', 'F', 'M', 'L'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 3120, 3120, 3120, 3121, 3120, 3120, ...
Resampling results
Accuracy Kappa Cost Accuracy SD Kappa SD Cost SD
0.756 0.602 0.523 0.0232 0.0389 0.0495
>
> sldaGrid <- expand.grid(NumVars = seq(2, 112, by = 5),
+ lambda = c(0, 0.01, .1, 1, 10))
> set.seed(857)
> sldaFit <- train(x = expandedTrain,
+ y = trainData$Class,
+ method = "sparseLDA",
+ tuneGrid = sldaGrid,
+ preProc = c("center", "scale"),
+ metric = "Cost",
+ maximize = FALSE,
+ trControl = ctrl)
Loading required package: sparseLDA
Loading required package: lars
Loaded lars 1.2
Loading required package: elasticnet
Loading required package: mda
> sldaFit
Sparse Linear Discriminant Analysis
3467 samples
112 predictors
4 classes: 'VF', 'F', 'M', 'L'
Pre-processing: centered, scaled
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 3120, 3120, 3120, 3121, 3120, 3120, ...
Resampling results across tuning parameters:
NumVars lambda Accuracy Kappa Cost Accuracy SD Kappa SD Cost SD
2 0 0.662 0.416 0.692 0.018 0.0326 0.0585
2 0.01 0.663 0.416 0.692 0.0179 0.0331 0.058
2 0.1 0.663 0.417 0.691 0.0169 0.0311 0.0573
2 1 0.662 0.416 0.693 0.0181 0.0327 0.0602
2 10 0.664 0.417 0.691 0.0164 0.0306 0.0547
7 0 0.681 0.457 0.707 0.0187 0.0333 0.0512
7 0.01 0.681 0.457 0.707 0.0187 0.0333 0.0512
7 0.1 0.681 0.457 0.707 0.0187 0.0333 0.0512
7 1 0.681 0.457 0.707 0.0188 0.0334 0.0512
7 10 0.681 0.457 0.707 0.0193 0.0341 0.0503
12 0 0.688 0.47 0.687 0.0181 0.0324 0.0526
12 0.01 0.688 0.47 0.687 0.0181 0.0324 0.0526
12 0.1 0.688 0.471 0.686 0.0182 0.0325 0.0524
12 1 0.688 0.47 0.687 0.018 0.0321 0.0522
12 10 0.687 0.469 0.689 0.0183 0.0326 0.0516
17 0 0.694 0.482 0.661 0.0178 0.0316 0.0516
17 0.01 0.694 0.483 0.661 0.0178 0.0317 0.0517
17 0.1 0.694 0.483 0.661 0.0181 0.032 0.0519
17 1 0.694 0.483 0.66 0.0176 0.0313 0.0512
17 10 0.693 0.482 0.662 0.0175 0.0312 0.0491
22 0 0.699 0.493 0.651 0.0187 0.0323 0.0487
22 0.01 0.699 0.493 0.651 0.0187 0.0323 0.0488
22 0.1 0.699 0.493 0.651 0.0187 0.0323 0.0487
22 1 0.699 0.493 0.651 0.0187 0.0323 0.0491
22 10 0.698 0.491 0.652 0.0185 0.032 0.0501
27 0 0.704 0.502 0.638 0.0195 0.0342 0.0578
27 0.01 0.704 0.503 0.637 0.0194 0.034 0.0574
27 0.1 0.704 0.503 0.638 0.0194 0.034 0.0578
27 1 0.704 0.503 0.638 0.0197 0.0345 0.0584
27 10 0.703 0.501 0.636 0.0199 0.0347 0.0592
32 0 0.712 0.518 0.626 0.0191 0.0336 0.0572
32 0.01 0.712 0.518 0.625 0.0191 0.0336 0.0572
32 0.1 0.712 0.518 0.625 0.0191 0.0336 0.0571
32 1 0.712 0.518 0.626 0.0191 0.0335 0.057
32 10 0.71 0.515 0.627 0.0193 0.0337 0.0566
37 0 0.721 0.536 0.611 0.0187 0.0322 0.0538
37 0.01 0.721 0.536 0.611 0.0187 0.0322 0.0538
37 0.1 0.721 0.536 0.611 0.0189 0.0324 0.0541
37 1 0.721 0.536 0.611 0.0187 0.0321 0.0532
37 10 0.717 0.529 0.615 0.0197 0.0339 0.0574
42 0 0.725 0.544 0.596 0.0186 0.0314 0.0508
42 0.01 0.725 0.544 0.596 0.0186 0.0315 0.0507
42 0.1 0.725 0.544 0.596 0.0185 0.0313 0.0506
42 1 0.725 0.544 0.595 0.0183 0.0311 0.0519
42 10 0.723 0.541 0.598 0.0203 0.0344 0.0522
47 0 0.727 0.548 0.578 0.0196 0.0325 0.0478
47 0.01 0.727 0.548 0.579 0.0193 0.0322 0.0486
47 0.1 0.727 0.548 0.579 0.0195 0.0325 0.0487
47 1 0.727 0.548 0.579 0.0194 0.0324 0.0491
47 10 0.725 0.546 0.584 0.0203 0.0336 0.0515
52 0 0.727 0.549 0.577 0.0206 0.0344 0.0476
52 0.01 0.727 0.549 0.577 0.0206 0.0344 0.0476
52 0.1 0.727 0.549 0.577 0.0205 0.0342 0.0475
52 1 0.727 0.548 0.577 0.021 0.0351 0.0483
52 10 0.725 0.546 0.579 0.0205 0.034 0.0495
57 0 0.73 0.553 0.573 0.0208 0.0348 0.0463
57 0.01 0.729 0.553 0.573 0.021 0.0351 0.0463
57 0.1 0.729 0.553 0.573 0.0209 0.035 0.0463
57 1 0.729 0.553 0.573 0.021 0.035 0.0455
57 10 0.728 0.551 0.574 0.021 0.0348 0.0474
62 0 0.736 0.565 0.56 0.0215 0.0359 0.0475
62 0.01 0.736 0.565 0.56 0.0215 0.0359 0.0475
62 0.1 0.736 0.565 0.56 0.0214 0.0357 0.0475
62 1 0.736 0.565 0.56 0.0211 0.0352 0.0475
62 10 0.733 0.56 0.563 0.021 0.0351 0.0485
67 0 0.742 0.576 0.549 0.0208 0.0344 0.0431
67 0.01 0.743 0.576 0.549 0.0208 0.0346 0.0432
67 0.1 0.743 0.576 0.549 0.0208 0.0345 0.0432
67 1 0.743 0.577 0.547 0.0212 0.0351 0.0449
67 10 0.739 0.57 0.553 0.0205 0.034 0.0452
72 0 0.747 0.585 0.539 0.0207 0.0346 0.0456
72 0.01 0.747 0.585 0.539 0.0207 0.0346 0.0456
72 0.1 0.747 0.585 0.539 0.0206 0.0344 0.0454
72 1 0.747 0.584 0.54 0.0205 0.0343 0.0447
72 10 0.743 0.578 0.546 0.0204 0.034 0.0432
77 0 0.751 0.591 0.534 0.0207 0.0347 0.042
77 0.01 0.751 0.591 0.534 0.0207 0.0347 0.042
77 0.1 0.751 0.591 0.534 0.0208 0.0348 0.0421
77 1 0.75 0.589 0.535 0.0213 0.0358 0.0429
77 10 0.747 0.584 0.54 0.0207 0.0345 0.0424
82 0 0.753 0.595 0.529 0.0196 0.0326 0.0409
82 0.01 0.753 0.595 0.529 0.0196 0.0326 0.041
82 0.1 0.753 0.595 0.529 0.0196 0.0326 0.0404
82 1 0.753 0.594 0.53 0.0199 0.0331 0.0399
82 10 0.748 0.586 0.537 0.0215 0.0359 0.0418
87 0 0.755 0.598 0.526 0.0202 0.0336 0.0428
87 0.01 0.755 0.598 0.526 0.0202 0.0336 0.0428
87 0.1 0.755 0.598 0.525 0.0203 0.0339 0.043
87 1 0.755 0.598 0.526 0.0202 0.0336 0.0412
87 10 0.75 0.59 0.532 0.0207 0.0347 0.0404
92 0 0.754 0.598 0.526 0.0214 0.0355 0.0451
92 0.01 0.754 0.598 0.527 0.0215 0.0357 0.045
92 0.1 0.755 0.598 0.526 0.0216 0.036 0.0452
92 1 0.754 0.598 0.526 0.0207 0.0345 0.0452
92 10 0.752 0.593 0.531 0.0213 0.0357 0.044
97 0 0.755 0.599 0.526 0.0217 0.0361 0.0452
97 0.01 0.755 0.599 0.526 0.0218 0.0363 0.0455
97 0.1 0.755 0.599 0.526 0.0218 0.0363 0.0455
97 1 0.755 0.599 0.525 0.0219 0.0363 0.0457
97 10 0.752 0.594 0.53 0.0217 0.0363 0.0444
102 0 0.754 0.598 0.527 0.0226 0.0377 0.0469
102 0.01 0.754 0.598 0.527 0.0224 0.0374 0.0467
102 0.1 0.754 0.598 0.527 0.0223 0.0373 0.0472
102 1 0.755 0.599 0.527 0.0224 0.0373 0.0475
102 10 0.753 0.595 0.53 0.0222 0.0371 0.0458
107 0 0.755 0.6 0.526 0.0232 0.0387 0.0497
107 0.01 0.755 0.6 0.526 0.0233 0.0389 0.0497
107 0.1 0.755 0.6 0.527 0.023 0.0383 0.0493
107 1 0.755 0.6 0.527 0.0225 0.0376 0.0479
107 10 0.753 0.597 0.53 0.0227 0.0378 0.0472
112 0 0.756 0.602 0.523 0.0232 0.0389 0.0495
112 0.01 0.756 0.602 0.523 0.0232 0.0388 0.0493
112 0.1 0.756 0.602 0.523 0.0232 0.0387 0.0501
112 1 0.756 0.601 0.524 0.0234 0.0391 0.0503
112 10 0.754 0.597 0.53 0.023 0.0384 0.0494
Cost was used to select the optimal model using the smallest value.
The final values used for the model were NumVars = 112 and lambda = 0.
>
> set.seed(857)
> nnetGrid <- expand.grid(decay = c(0, 0.001, 0.01, .1, .5),
+ size = (1:10)*2 - 1)
> nnetFit <- train(modForm,
+ data = trainData,
+ method = "nnet",
+ metric = "Cost",
+ maximize = FALSE,
+ tuneGrid = nnetGrid,
+ trace = FALSE,
+ MaxNWts = 2000,
+ maxit = 1000,
+ preProc = c("center", "scale"),
+ trControl = ctrl)
Loading required package: nnet
> nnetFit
Neural Network
3467 samples
7 predictors
4 classes: 'VF', 'F', 'M', 'L'
Pre-processing: centered, scaled
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 3120, 3120, 3120, 3121, 3120, 3120, ...
Resampling results across tuning parameters:
decay size Accuracy Kappa Cost Accuracy SD Kappa SD Cost SD
0 1 0.683 0.463 0.86 0.0295 0.0512 0.164
0 3 0.743 0.577 0.607 0.027 0.045 0.0789
0 5 0.757 0.605 0.524 0.0215 0.0354 0.0697
0 7 0.766 0.62 0.499 0.02 0.0324 0.0622
0 9 0.769 0.627 0.466 0.0216 0.0354 0.0547
0 11 0.774 0.635 0.452 0.0217 0.0351 0.0498
0 13 0.774 0.636 0.454 0.0202 0.0327 0.0561
0 15 0.768 0.626 0.455 0.0216 0.0345 0.0487
0 17 0.773 0.637 0.436 0.0209 0.0326 0.0459
0 19 0.772 0.633 0.437 0.019 0.0298 0.0391
0.001 1 0.694 0.486 0.769 0.0234 0.0403 0.104
0.001 3 0.749 0.588 0.591 0.0241 0.0394 0.066
0.001 5 0.766 0.619 0.513 0.02 0.0332 0.0617
0.001 7 0.778 0.64 0.485 0.0228 0.0377 0.067
0.001 9 0.782 0.647 0.452 0.0217 0.0357 0.0552
0.001 11 0.779 0.643 0.445 0.0211 0.034 0.0493
0.001 13 0.779 0.644 0.434 0.0216 0.0359 0.0592
0.001 15 0.779 0.644 0.432 0.0197 0.0313 0.0499
0.001 17 0.78 0.648 0.419 0.0212 0.0345 0.0457
0.001 19 0.777 0.643 0.417 0.0263 0.0416 0.061
0.01 1 0.694 0.488 0.74 0.022 0.0376 0.0522
0.01 3 0.756 0.601 0.585 0.0203 0.0336 0.0629
0.01 5 0.769 0.622 0.528 0.0238 0.0391 0.0735
0.01 7 0.778 0.64 0.475 0.0179 0.03 0.0513
0.01 9 0.782 0.648 0.448 0.021 0.0335 0.0482
0.01 11 0.785 0.653 0.437 0.0226 0.0367 0.0512
0.01 13 0.784 0.652 0.438 0.0204 0.0329 0.0501
0.01 15 0.784 0.652 0.428 0.0197 0.0318 0.0465
0.01 17 0.782 0.65 0.419 0.0184 0.0292 0.0441
0.01 19 0.787 0.658 0.412 0.0201 0.0318 0.0477
0.1 1 0.693 0.485 0.765 0.0202 0.0342 0.048
0.1 3 0.759 0.604 0.588 0.021 0.0351 0.0566
0.1 5 0.778 0.637 0.502 0.0233 0.0382 0.0622
0.1 7 0.784 0.649 0.474 0.0229 0.0375 0.06
0.1 9 0.794 0.665 0.434 0.0175 0.0283 0.0435
0.1 11 0.791 0.662 0.436 0.0228 0.0369 0.0553
0.1 13 0.793 0.665 0.425 0.0196 0.0322 0.0519
0.1 15 0.794 0.667 0.421 0.0228 0.0369 0.0552
0.1 17 0.796 0.671 0.407 0.0226 0.0362 0.0472
0.1 19 0.799 0.676 0.398 0.0214 0.034 0.0437
0.5 1 0.707 0.5 0.848 0.0199 0.0351 0.0551
0.5 3 0.756 0.598 0.606 0.0182 0.0304 0.0572
0.5 5 0.776 0.634 0.524 0.0196 0.0327 0.0518
0.5 7 0.785 0.649 0.499 0.0185 0.0301 0.0514
0.5 9 0.788 0.655 0.471 0.0177 0.0294 0.053
0.5 11 0.793 0.664 0.449 0.0195 0.0324 0.047
0.5 13 0.793 0.663 0.448 0.022 0.0357 0.0509
0.5 15 0.796 0.668 0.429 0.0201 0.0325 0.0434
0.5 17 0.795 0.668 0.435 0.0227 0.0375 0.0527
0.5 19 0.801 0.677 0.422 0.02 0.0326 0.0492
Cost was used to select the optimal model using the smallest value.
The final values used for the model were size = 19 and decay = 0.1.
>
> set.seed(857)
> plsFit <- train(x = expandedTrain,
+ y = trainData$Class,
+ method = "pls",
+ metric = "Cost",
+ maximize = FALSE,
+ tuneLength = 100,
+ preProc = c("center", "scale"),
+ trControl = ctrl)
Loading required package: pls
Attaching package: ‘pls’
The following object is masked from ‘package:caret’:
R2
The following object is masked from ‘package:stats’:
loadings
> plsFit
Partial Least Squares
3467 samples
112 predictors
4 classes: 'VF', 'F', 'M', 'L'
Pre-processing: centered, scaled
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 3120, 3120, 3120, 3121, 3120, 3120, ...
Resampling results across tuning parameters:
ncomp Accuracy Kappa Cost Accuracy SD Kappa SD Cost SD
1 0.645 0.352 0.998 0.0172 0.0342 0.0282
2 0.638 0.342 1.03 0.016 0.031 0.0264
3 0.646 0.357 1.02 0.0158 0.0311 0.0244
4 0.649 0.369 0.974 0.0162 0.0316 0.0408
5 0.662 0.4 0.921 0.0169 0.0319 0.0365
6 0.676 0.43 0.878 0.0195 0.0359 0.0485
7 0.677 0.434 0.853 0.0197 0.0363 0.0499
8 0.682 0.445 0.828 0.0203 0.0376 0.0532
9 0.689 0.457 0.796 0.0194 0.0358 0.0483
10 0.691 0.463 0.788 0.0194 0.0361 0.0515
11 0.692 0.467 0.776 0.0202 0.037 0.046
12 0.698 0.479 0.768 0.0196 0.0356 0.0496
13 0.7 0.484 0.761 0.0196 0.0352 0.0487
14 0.701 0.485 0.768 0.0196 0.0347 0.0493
15 0.701 0.486 0.766 0.0201 0.0362 0.051
16 0.704 0.492 0.761 0.0208 0.037 0.0504
17 0.707 0.497 0.761 0.0209 0.0376 0.0496
18 0.706 0.496 0.759 0.0194 0.0347 0.0527
19 0.707 0.498 0.756 0.0212 0.0376 0.0543
20 0.71 0.503 0.75 0.0186 0.0332 0.0486
21 0.716 0.514 0.74 0.0196 0.0347 0.052
22 0.719 0.519 0.734 0.0193 0.0344 0.0512
23 0.729 0.537 0.725 0.0184 0.0324 0.0485
24 0.726 0.533 0.731 0.0202 0.0355 0.0512
25 0.727 0.536 0.712 0.0198 0.0349 0.0489
26 0.727 0.536 0.711 0.0218 0.0381 0.0495
27 0.728 0.539 0.708 0.0205 0.0363 0.0495
28 0.728 0.539 0.703 0.0205 0.0361 0.0525
29 0.728 0.54 0.704 0.021 0.037 0.0514
30 0.73 0.543 0.698 0.0215 0.0378 0.0515
31 0.731 0.546 0.695 0.0213 0.0373 0.0499
32 0.732 0.547 0.693 0.0225 0.0393 0.0497
33 0.734 0.551 0.688 0.0216 0.0378 0.0487
34 0.736 0.553 0.684 0.0216 0.0377 0.0497
35 0.737 0.556 0.683 0.0198 0.0348 0.0464
36 0.739 0.559 0.677 0.0202 0.0353 0.0469
37 0.74 0.56 0.675 0.0217 0.0378 0.0503
38 0.74 0.561 0.673 0.0199 0.0345 0.049
39 0.742 0.564 0.669 0.0203 0.0354 0.0509
40 0.741 0.563 0.67 0.019 0.0333 0.0491
41 0.742 0.564 0.667 0.0196 0.034 0.0492
42 0.742 0.564 0.666 0.0197 0.0342 0.0509
43 0.742 0.565 0.662 0.0203 0.0352 0.0507
44 0.743 0.567 0.661 0.0202 0.0349 0.0499
45 0.743 0.567 0.658 0.0203 0.0354 0.0501
46 0.743 0.568 0.657 0.0205 0.0356 0.0503
47 0.743 0.568 0.655 0.0203 0.0352 0.0494
48 0.745 0.571 0.65 0.02 0.0347 0.0497
49 0.744 0.57 0.652 0.0201 0.0349 0.0507
50 0.745 0.571 0.65 0.0199 0.0344 0.0491
51 0.744 0.569 0.652 0.0197 0.0339 0.0495
52 0.744 0.57 0.65 0.0197 0.0341 0.0494
53 0.745 0.571 0.649 0.0207 0.0357 0.0512
54 0.745 0.572 0.648 0.0204 0.0351 0.0499
55 0.745 0.572 0.648 0.0203 0.0349 0.0507
56 0.745 0.572 0.647 0.0196 0.0337 0.051
57 0.746 0.573 0.644 0.0194 0.0332 0.0481
58 0.745 0.572 0.646 0.0191 0.0328 0.0487
59 0.745 0.573 0.645 0.0197 0.034 0.05
60 0.746 0.573 0.644 0.0198 0.0342 0.0504
61 0.746 0.574 0.642 0.0194 0.0335 0.0495
62 0.746 0.574 0.641 0.0201 0.0347 0.0499
63 0.746 0.574 0.641 0.0206 0.0355 0.0505
64 0.747 0.575 0.641 0.0201 0.0347 0.05
65 0.747 0.575 0.64 0.0206 0.0354 0.0491
66 0.747 0.576 0.638 0.02 0.0345 0.0492
67 0.747 0.576 0.639 0.0203 0.0349 0.0488
68 0.747 0.576 0.639 0.0202 0.0347 0.0487
69 0.747 0.575 0.64 0.0204 0.0351 0.0502
70 0.747 0.576 0.639 0.0198 0.034 0.0491
71 0.747 0.576 0.638 0.0201 0.0345 0.0486
72 0.748 0.577 0.636 0.0201 0.0346 0.05
73 0.748 0.577 0.637 0.0201 0.0345 0.0496
74 0.748 0.577 0.637 0.0205 0.0354 0.0516
75 0.747 0.576 0.638 0.0207 0.0357 0.0523
76 0.747 0.576 0.639 0.0205 0.0353 0.0511
77 0.747 0.576 0.639 0.0201 0.0346 0.0501
78 0.747 0.576 0.639 0.02 0.0345 0.0506
79 0.747 0.575 0.639 0.0198 0.0341 0.0491
80 0.747 0.575 0.64 0.0197 0.034 0.0495
81 0.747 0.575 0.641 0.02 0.0344 0.0494
82 0.747 0.575 0.641 0.0203 0.035 0.0498
83 0.747 0.575 0.641 0.0201 0.0347 0.0494
84 0.747 0.575 0.641 0.0203 0.0349 0.0496
85 0.747 0.575 0.641 0.0203 0.035 0.0497
86 0.747 0.575 0.641 0.0198 0.0341 0.0494
87 0.747 0.575 0.641 0.0201 0.0346 0.0499
88 0.747 0.575 0.641 0.0202 0.0348 0.0499
89 0.747 0.575 0.641 0.0203 0.0349 0.0498
90 0.747 0.575 0.641 0.0203 0.035 0.0499
91 0.747 0.575 0.64 0.0204 0.0351 0.0501
92 0.747 0.575 0.641 0.0204 0.035 0.0498
93 0.747 0.575 0.641 0.0205 0.0353 0.0499
94 0.747 0.575 0.641 0.0206 0.0353 0.0499
95 0.747 0.575 0.641 0.0206 0.0354 0.0499
96 0.747 0.575 0.641 0.0205 0.0352 0.0498
97 0.747 0.575 0.641 0.0205 0.0352 0.0498
98 0.747 0.575 0.641 0.0205 0.0352 0.0498
99 0.747 0.575 0.641 0.0205 0.0352 0.0498
100 0.747 0.575 0.641 0.0206 0.0353 0.0499
Cost was used to select the optimal model using the smallest value.
The final value used for the model was ncomp = 72.
>
> set.seed(857)
> fdaFit <- train(modForm, data = trainData,
+ method = "fda",
+ metric = "Cost",
+ maximize = FALSE,
+ tuneLength = 25,
+ trControl = ctrl)
Loading required package: earth
Loading required package: plotmo
Loading required package: plotrix
> fdaFit
Flexible Discriminant Analysis
3467 samples
7 predictors
4 classes: 'VF', 'F', 'M', 'L'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 3120, 3120, 3120, 3121, 3120, 3120, ...
Resampling results across tuning parameters:
nprune Accuracy Kappa Cost Accuracy SD Kappa SD Cost SD
2 0.524 0.0711 0.929 0.00646 0.021 0.0455
3 0.541 0.142 0.843 0.00898 0.0221 0.0368
4 0.61 0.298 0.79 0.0143 0.03 0.0412
5 0.659 0.405 0.753 0.0156 0.03 0.042
6 0.678 0.451 0.75 0.018 0.0324 0.0468
7 0.684 0.466 0.699 0.0174 0.0305 0.0513
8 0.693 0.487 0.64 0.0206 0.0359 0.0522
9 0.695 0.491 0.634 0.0214 0.0369 0.0549
10 0.698 0.496 0.631 0.021 0.0363 0.0551
11 0.71 0.518 0.62 0.0224 0.0382 0.0575
12 0.713 0.524 0.617 0.0204 0.0351 0.054
13 0.715 0.529 0.612 0.0229 0.0388 0.0584
14 0.724 0.544 0.602 0.0222 0.0375 0.0593
15 0.726 0.547 0.602 0.019 0.0328 0.0567
16 0.727 0.548 0.602 0.0202 0.0344 0.0559
17 0.725 0.545 0.608 0.019 0.033 0.0571
18 0.726 0.547 0.606 0.0205 0.0352 0.0588
19 0.727 0.548 0.607 0.0206 0.0348 0.0598
20 0.727 0.549 0.606 0.0208 0.0353 0.0596
21 0.729 0.552 0.602 0.0213 0.0358 0.0572
22 0.731 0.555 0.6 0.0213 0.0361 0.0583
23 0.732 0.557 0.598 0.0202 0.0343 0.0562
Tuning parameter 'degree' was held constant at a value of 1
Cost was used to select the optimal model using the smallest value.
The final values used for the model were degree = 1 and nprune = 23.
>
> set.seed(857)
> rfFit <- train(x = trainData[, predictors],
+ y = trainData$Class,
+ method = "rf",
+ metric = "Cost",
+ maximize = FALSE,
+ tuneLength = 10,
+ ntree = 2000,
+ importance = TRUE,
+ trControl = ctrl)
Loading required package: randomForest
randomForest 4.6-7
Type rfNews() to see new features/changes/bug fixes.
Attaching package: ‘randomForest’
The following object is masked from ‘package:Hmisc’:
combine
note: only 6 unique complexity parameters in default grid. Truncating the grid to 6 .
> rfFit
Random Forest
3467 samples
7 predictors
4 classes: 'VF', 'F', 'M', 'L'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 3120, 3120, 3120, 3121, 3120, 3120, ...
Resampling results across tuning parameters:
mtry Accuracy Kappa Cost Accuracy SD Kappa SD Cost SD
2 0.842 0.743 0.336 0.0168 0.0275 0.042
3 0.845 0.748 0.328 0.0176 0.0289 0.0419
4 0.845 0.748 0.326 0.0173 0.0282 0.0434
5 0.843 0.746 0.328 0.0166 0.0272 0.0443
6 0.843 0.745 0.328 0.0172 0.0282 0.0462
7 0.842 0.744 0.328 0.0171 0.0279 0.0437
Cost was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 4.
>
> set.seed(857)
> rfFitCost <- train(x = trainData[, predictors],
+ y = trainData$Class,
+ method = "rf",
+ metric = "Cost",
+ maximize = FALSE,
+ tuneLength = 10,
+ ntree = 2000,
+ classwt = c(VF = 1, F = 1, M = 5, L = 10),
+ importance = TRUE,
+ trControl = ctrl)
note: only 6 unique complexity parameters in default grid. Truncating the grid to 6 .
> rfFitCost
Random Forest
3467 samples
7 predictors
4 classes: 'VF', 'F', 'M', 'L'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 3120, 3120, 3120, 3121, 3120, 3120, ...
Resampling results across tuning parameters:
mtry Accuracy Kappa Cost Accuracy SD Kappa SD Cost SD
2 0.84 0.739 0.34 0.0171 0.0281 0.0452
3 0.843 0.745 0.345 0.0159 0.0259 0.0413
4 0.844 0.746 0.345 0.016 0.0263 0.0439
5 0.844 0.747 0.341 0.0182 0.0298 0.0459
6 0.846 0.75 0.337 0.0168 0.0275 0.0432
7 0.845 0.748 0.337 0.0169 0.0274 0.0416
Cost was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 7.
>
> c5Grid <- expand.grid(trials = c(1, (1:10)*10),
+ model = "tree",
+ winnow = c(TRUE, FALSE))
> set.seed(857)
> c50Fit <- train(x = trainData[, predictors],
+ y = trainData$Class,
+ method = "C5.0",
+ metric = "Cost",
+ maximize = FALSE,
+ tuneGrid = c5Grid,
+ trControl = ctrl)
Loading required package: C50
Loading required package: plyr
Attaching package: ‘plyr’
The following object is masked from ‘package:Hmisc’:
is.discrete, summarize
> c50Fit
C5.0
3467 samples
7 predictors
4 classes: 'VF', 'F', 'M', 'L'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 3120, 3120, 3120, 3121, 3120, 3120, ...
Resampling results across tuning parameters:
winnow trials Accuracy Kappa Cost Accuracy SD Kappa SD Cost SD
FALSE 1 0.801 0.677 0.396 0.0183 0.0305 0.0476
FALSE 10 0.833 0.729 0.338 0.0185 0.0308 0.047
FALSE 20 0.836 0.735 0.326 0.0177 0.0291 0.0471
FALSE 30 0.839 0.739 0.324 0.0175 0.0288 0.0439
FALSE 40 0.839 0.739 0.324 0.0177 0.0289 0.0433
FALSE 50 0.839 0.739 0.322 0.0175 0.0286 0.0451
FALSE 60 0.839 0.74 0.322 0.0185 0.0303 0.0444
FALSE 70 0.84 0.741 0.32 0.0165 0.0271 0.0432
FALSE 80 0.84 0.741 0.319 0.0171 0.0281 0.0431
FALSE 90 0.841 0.743 0.318 0.0163 0.027 0.044
FALSE 100 0.841 0.742 0.32 0.016 0.0263 0.0432
TRUE 1 0.801 0.678 0.397 0.018 0.0299 0.0463
TRUE 10 0.832 0.727 0.34 0.0182 0.0302 0.0484
TRUE 20 0.834 0.732 0.327 0.0176 0.0288 0.048
TRUE 30 0.837 0.737 0.323 0.0168 0.0276 0.0456
TRUE 40 0.838 0.737 0.323 0.0167 0.0272 0.0443
TRUE 50 0.838 0.737 0.32 0.0164 0.0267 0.0451
TRUE 60 0.839 0.739 0.32 0.017 0.0276 0.0436
TRUE 70 0.839 0.739 0.319 0.0158 0.0258 0.0436
TRUE 80 0.839 0.74 0.318 0.0161 0.0264 0.0438
TRUE 90 0.84 0.741 0.317 0.0161 0.0265 0.0453
TRUE 100 0.841 0.742 0.317 0.0158 0.0259 0.0451
Tuning parameter 'model' was held constant at a value of tree
Cost was used to select the optimal model using the smallest value.
The final values used for the model were trials = 90, model = tree and winnow
= TRUE.
>
> set.seed(857)
> c50Cost <- train(x = trainData[, predictors],
+ y = trainData$Class,
+ method = "C5.0",
+ metric = "Cost",
+ maximize = FALSE,
+ costs = costMatrix,
+ tuneGrid = c5Grid,
+ trControl = ctrl)
> c50Cost
C5.0
3467 samples
7 predictors
4 classes: 'VF', 'F', 'M', 'L'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 3120, 3120, 3120, 3121, 3120, 3120, ...
Resampling results across tuning parameters:
winnow trials Accuracy Kappa Cost Accuracy SD Kappa SD Cost SD
FALSE 1 0.796 0.667 0.462 0.0185 0.0312 0.0526
FALSE 10 0.829 0.723 0.346 0.0188 0.0311 0.047
FALSE 20 0.834 0.731 0.33 0.0204 0.0337 0.0506
FALSE 30 0.835 0.733 0.325 0.0192 0.0318 0.048
FALSE 40 0.835 0.733 0.322 0.018 0.0297 0.0433
FALSE 50 0.836 0.735 0.318 0.0192 0.0316 0.0442
FALSE 60 0.836 0.734 0.318 0.0186 0.0307 0.045
FALSE 70 0.837 0.736 0.315 0.0181 0.0299 0.0454
FALSE 80 0.837 0.737 0.314 0.0189 0.031 0.0461
FALSE 90 0.839 0.739 0.314 0.0178 0.0293 0.0462
FALSE 100 0.839 0.74 0.317 0.0183 0.0302 0.0483
TRUE 1 0.773 0.624 0.554 0.0368 0.0694 0.128
TRUE 10 0.793 0.658 0.461 0.0511 0.094 0.174
TRUE 20 0.796 0.663 0.449 0.0524 0.0963 0.179
TRUE 30 0.797 0.664 0.446 0.0529 0.097 0.181
TRUE 40 0.796 0.664 0.446 0.0527 0.0967 0.181
TRUE 50 0.796 0.663 0.445 0.0525 0.0964 0.181
TRUE 60 0.796 0.663 0.444 0.0523 0.0962 0.182
TRUE 70 0.796 0.664 0.443 0.0522 0.096 0.182
TRUE 80 0.798 0.666 0.441 0.0533 0.0977 0.184
TRUE 90 0.799 0.668 0.441 0.0542 0.0991 0.184
TRUE 100 0.799 0.668 0.442 0.0542 0.0991 0.183
Tuning parameter 'model' was held constant at a value of tree
Cost was used to select the optimal model using the smallest value.
The final values used for the model were trials = 90, model = tree and winnow
= FALSE.
>
> set.seed(857)
> bagFit <- train(x = trainData[, predictors],
+ y = trainData$Class,
+ method = "treebag",
+ metric = "Cost",
+ maximize = FALSE,
+ nbagg = 50,
+ trControl = ctrl)
Loading required package: ipred
Loading required package: prodlim
KernSmooth 2.23 loaded
Copyright M. P. Wand 1997-2009
> bagFit
Bagged CART
3467 samples
7 predictors
4 classes: 'VF', 'F', 'M', 'L'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 3120, 3120, 3120, 3121, 3120, 3120, ...
Resampling results
Accuracy Kappa Cost Accuracy SD Kappa SD Cost SD
0.836 0.735 0.333 0.0155 0.0249 0.0417
>
> ### Use the caret bag() function to bag the cost-sensitive CART model
> rpCost <- function(x, y)
+ {
+ costMatrix <- ifelse(diag(4) == 1, 0, 1)
+ costMatrix[4, 1] <- 10
+ costMatrix[3, 1] <- 5
+ costMatrix[4, 2] <- 5
+ costMatrix[3, 2] <- 5
+ library(rpart)
+ tmp <- x
+ tmp$y <- y
+ rpart(y~., data = tmp, control = rpart.control(cp = 0),
+ parms =list(loss = costMatrix))
+ }
> rpPredict <- function(object, x) predict(object, x)
>
> rpAgg <- function (x, type = "class")
+ {
+ pooled <- x[[1]] * NA
+ n <- nrow(pooled)
+ classes <- colnames(pooled)
+ for (i in 1:ncol(pooled))
+ {
+ tmp <- lapply(x, function(y, col) y[, col], col = i)
+ tmp <- do.call("rbind", tmp)
+ pooled[, i] <- apply(tmp, 2, median)
+ }
+ pooled <- apply(pooled, 1, function(x) x/sum(x))
+ if (n != nrow(pooled)) pooled <- t(pooled)
+ out <- factor(classes[apply(pooled, 1, which.max)], levels = classes)
+ out
+ }
>
>
> set.seed(857)
> rpCostBag <- train(trainData[, predictors],
+ trainData$Class,
+ "bag",
+ B = 50,
+ bagControl = bagControl(fit = rpCost,
+ predict = rpPredict,
+ aggregate = rpAgg,
+ downSample = FALSE,
+ allowParallel = FALSE),
+ trControl = ctrl)
> rpCostBag
Bagged Model
3467 samples
7 predictors
4 classes: 'VF', 'F', 'M', 'L'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 3120, 3120, 3120, 3121, 3120, 3120, ...
Resampling results
Accuracy Kappa Cost Accuracy SD Kappa SD Cost SD
0.807 0.689 0.369 0.0163 0.0263 0.0446
Tuning parameter 'vars' was held constant at a value of 7
>
> set.seed(857)
> svmRFit <- train(modForm ,
+ data = trainData,
+ method = "svmRadial",
+ metric = "Cost",
+ maximize = FALSE,
+ preProc = c("center", "scale"),
+ tuneLength = 15,
+ trControl = ctrl)
Loading required package: kernlab
> svmRFit
Support Vector Machines with Radial Basis Function Kernel
3467 samples
7 predictors
4 classes: 'VF', 'F', 'M', 'L'
Pre-processing: centered, scaled
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 3120, 3120, 3120, 3121, 3120, 3120, ...
Resampling results across tuning parameters:
C Accuracy Kappa Cost Accuracy SD Kappa SD Cost SD
0.25 0.704 0.486 0.853 0.0168 0.0299 0.0405
0.5 0.744 0.568 0.671 0.0202 0.0352 0.0548
1 0.77 0.618 0.562 0.0193 0.0332 0.0494
2 0.784 0.644 0.522 0.0207 0.0347 0.0476
4 0.791 0.658 0.49 0.0194 0.0322 0.044
8 0.797 0.668 0.456 0.0181 0.0297 0.0391
16 0.799 0.673 0.438 0.0184 0.0299 0.0413
32 0.801 0.677 0.424 0.0183 0.0296 0.0394
64 0.802 0.679 0.415 0.0183 0.0298 0.0446
128 0.802 0.68 0.404 0.0202 0.0331 0.0495
256 0.805 0.684 0.393 0.022 0.0363 0.0522
512 0.807 0.689 0.385 0.021 0.0345 0.0533
1020 0.808 0.69 0.38 0.0212 0.0345 0.0543
2050 0.804 0.684 0.387 0.0218 0.0353 0.0518
4100 0.802 0.679 0.391 0.0199 0.0324 0.0489
Tuning parameter 'sigma' was held constant at a value of 0.03332721
Cost was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.0333 and C = 1024.
>
> set.seed(857)
> svmRFitCost <- train(modForm, data = trainData,
+ method = "svmRadial",
+ metric = "Cost",
+ maximize = FALSE,
+ preProc = c("center", "scale"),
+ class.weights = c(VF = 1, F = 1, M = 5, L = 10),
+ tuneLength = 15,
+ trControl = ctrl)
> svmRFitCost
Support Vector Machines with Radial Basis Function Kernel
3467 samples
7 predictors
4 classes: 'VF', 'F', 'M', 'L'
Pre-processing: centered, scaled
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 3120, 3120, 3120, 3121, 3120, 3120, ...
Resampling results across tuning parameters:
C Accuracy Kappa Cost Accuracy SD Kappa SD Cost SD
0.25 0.681 0.513 0.378 0.0227 0.0333 0.0402
0.5 0.703 0.543 0.365 0.0201 0.0303 0.0354
1 0.726 0.576 0.347 0.0185 0.0278 0.0321
2 0.744 0.602 0.337 0.0179 0.0272 0.0356
4 0.753 0.614 0.339 0.0161 0.0244 0.0304
8 0.762 0.626 0.34 0.0165 0.0258 0.0395
16 0.77 0.637 0.347 0.0182 0.0288 0.0411
32 0.777 0.647 0.346 0.0186 0.0292 0.0446
64 0.783 0.655 0.35 0.0209 0.0331 0.0481
128 0.787 0.661 0.359 0.0223 0.0356 0.0517
256 0.79 0.665 0.36 0.0231 0.0371 0.0515
512 0.791 0.666 0.37 0.0235 0.0379 0.0521
1020 0.794 0.669 0.376 0.0222 0.0358 0.0534
2050 0.795 0.671 0.378 0.0224 0.0363 0.0517
4100 0.793 0.667 0.389 0.0202 0.0325 0.0503
Tuning parameter 'sigma' was held constant at a value of 0.03332721
Cost was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.0333 and C = 2.
>
> modelList <- list(C5.0 = c50Fit,
+ "C5.0 (Costs)" = c50Cost,
+ CART =rpFit,
+ "CART (Costs)" = rpFitCost,
+ "Bagging (Costs)" = rpCostBag,
+ FDA = fdaFit,
+ SVM = svmRFit,
+ "SVM (Weights)" = svmRFitCost,
+ PLS = plsFit,
+ "Random Forests" = rfFit,
+ LDA = ldaFit,
+ "LDA (Sparse)" = sldaFit,
+ "Neural Networks" = nnetFit,
+ Bagging = bagFit)
>
>
> ################################################################################
> ### Section 17.2 Results
>
> rs <- resamples(modelList)
> summary(rs)
Call:
summary.resamples(object = rs)
Models: C5.0, C5.0 (Costs), CART, CART (Costs), Bagging (Costs), FDA, SVM, SVM (Weights), PLS, Random Forests, LDA, LDA (Sparse), Neural Networks, Bagging
Number of resamples: 50
Accuracy
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
C5.0 0.8040 0.8278 0.8427 0.8404 0.8473 0.8736 0
C5.0 (Costs) 0.8069 0.8249 0.8357 0.8387 0.8500 0.8757 0
CART 0.7328 0.7637 0.7723 0.7738 0.7859 0.8242 0
CART (Costs) 0.6888 0.7081 0.7201 0.7199 0.7312 0.7550 0
Bagging (Costs) 0.7637 0.7949 0.8092 0.8065 0.8173 0.8329 0
FDA 0.6686 0.7199 0.7309 0.7315 0.7457 0.7723 0
SVM 0.7579 0.7961 0.8055 0.8076 0.8202 0.8555 0
SVM (Weights) 0.7069 0.7320 0.7435 0.7444 0.7543 0.7896 0
PLS 0.7061 0.7351 0.7460 0.7478 0.7608 0.7960 0
Random Forests 0.8017 0.8324 0.8444 0.8447 0.8559 0.8844 0
LDA 0.7176 0.7389 0.7511 0.7560 0.7752 0.8132 0
LDA (Sparse) 0.7176 0.7389 0.7511 0.7560 0.7752 0.8132 0
Neural Networks 0.7522 0.7844 0.7991 0.7990 0.8143 0.8621 0
Bagging 0.8069 0.8262 0.8372 0.8361 0.8473 0.8671 0
Kappa
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
C5.0 0.6792 0.7208 0.7450 0.7414 0.7545 0.7951 0
C5.0 (Costs) 0.6860 0.7158 0.7361 0.7387 0.7600 0.7979 0
CART 0.5655 0.6118 0.6297 0.6314 0.6505 0.7165 0
CART (Costs) 0.5193 0.5465 0.5669 0.5649 0.5825 0.6187 0
Bagging (Costs) 0.6170 0.6694 0.6922 0.6891 0.7067 0.7339 0
FDA 0.4497 0.5381 0.5535 0.5571 0.5819 0.6308 0
SVM 0.6087 0.6739 0.6869 0.6895 0.7095 0.7655 0
SVM (Weights) 0.5428 0.5855 0.5990 0.6017 0.6151 0.6699 0
PLS 0.5080 0.5558 0.5740 0.5768 0.6010 0.6598 0
Random Forests 0.6784 0.7282 0.7477 0.7477 0.7655 0.8107 0
LDA 0.5401 0.5712 0.5931 0.6020 0.6361 0.6968 0
LDA (Sparse) 0.5401 0.5712 0.5931 0.6020 0.6361 0.6968 0
Neural Networks 0.6028 0.6512 0.6761 0.6761 0.6980 0.7765 0
Bagging 0.6830 0.7168 0.7346 0.7346 0.7533 0.7844 0
Cost
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
C5.0 0.2254 0.2919 0.3146 0.3172 0.3357 0.4265 0
C5.0 (Costs) 0.2283 0.2795 0.3112 0.3138 0.3472 0.4195 0
CART 0.3718 0.4761 0.5144 0.5095 0.5465 0.6580 0
CART (Costs) 0.2882 0.3220 0.3425 0.3427 0.3613 0.4236 0
Bagging (Costs) 0.2803 0.3345 0.3646 0.3693 0.3954 0.5130 0
FDA 0.4813 0.5552 0.5908 0.5983 0.6433 0.7118 0
SVM 0.2717 0.3465 0.3790 0.3802 0.4022 0.5260 0
SVM (Weights) 0.2565 0.3134 0.3309 0.3367 0.3598 0.4265 0
PLS 0.5562 0.5937 0.6297 0.6364 0.6712 0.7435 0
Random Forests 0.2543 0.2997 0.3184 0.3259 0.3429 0.4265 0
LDA 0.4150 0.4913 0.5237 0.5229 0.5632 0.6254 0
LDA (Sparse) 0.4150 0.4913 0.5237 0.5229 0.5632 0.6254 0
Neural Networks 0.3055 0.3729 0.3988 0.3981 0.4261 0.5029 0
Bagging 0.2630 0.3057 0.3261 0.3326 0.3581 0.4467 0
>
> confusionMatrix(rpFitCost, "none")
Cross-Validated (10 fold, repeated 5 times) Confusion Matrix
(entries are un-normalized counts)
Reference
Prediction VF F M L
VF 157.5 25.6 1.9 0.2
F 10.0 43.1 3.3 0.2
M 9.4 37.0 34.3 5.7
L 0.1 2.0 1.7 14.7
> confusionMatrix(rfFit, "none")
Cross-Validated (10 fold, repeated 5 times) Confusion Matrix
(entries are un-normalized counts)
Reference
Prediction VF F M L
VF 164.8 17.9 1.3 0.2
F 12.0 83.8 11.6 1.9
M 0.2 5.5 27.3 1.8
L 0.0 0.6 1.0 16.9
>
> plot(bwplot(rs, metric = "Cost"))
>
> rfPred <- predict(rfFit, testData)
> rpPred <- predict(rpFitCost, testData)
>
> confusionMatrix(rfPred, testData$Class)
Confusion Matrix and Statistics
Reference
Prediction VF F M L
VF 414 45 3 0
F 28 206 27 5
M 0 18 71 6
L 0 0 1 40
Overall Statistics
Accuracy : 0.8461
95% CI : (0.8202, 0.8695)
No Information Rate : 0.5116
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.7496
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: VF Class: F Class: M Class: L
Sensitivity 0.9367 0.7658 0.69608 0.78431
Specificity 0.8863 0.8992 0.96850 0.99877
Pos Pred Value 0.8961 0.7744 0.74737 0.97561
Neg Pred Value 0.9303 0.8946 0.95969 0.98663
Prevalence 0.5116 0.3113 0.11806 0.05903
Detection Rate 0.4792 0.2384 0.08218 0.04630
Detection Prevalence 0.5347 0.3079 0.10995 0.04745
Balanced Accuracy 0.9115 0.8325 0.83229 0.89154
> confusionMatrix(rpPred, testData$Class)
Confusion Matrix and Statistics
Reference
Prediction VF F M L
VF 383 61 5 1
F 32 106 7 2
M 26 99 87 15
L 1 3 3 33
Overall Statistics
Accuracy : 0.7049
95% CI : (0.6732, 0.7351)
No Information Rate : 0.5116
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.5437
Mcnemar's Test P-Value : < 2.2e-16
Statistics by Class:
Class: VF Class: F Class: M Class: L
Sensitivity 0.8665 0.3941 0.8529 0.64706
Specificity 0.8412 0.9311 0.8163 0.99139
Pos Pred Value 0.8511 0.7211 0.3833 0.82500
Neg Pred Value 0.8575 0.7727 0.9765 0.97816
Prevalence 0.5116 0.3113 0.1181 0.05903
Detection Rate 0.4433 0.1227 0.1007 0.03819
Detection Prevalence 0.5208 0.1701 0.2627 0.04630
Balanced Accuracy 0.8539 0.6626 0.8346 0.81922
>
>
> ################################################################################
> ### Session Information
>
> sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel grid tools splines stats graphics grDevices
[8] utils datasets methods base
other attached packages:
[1] kernlab_0.9-18 ipred_0.9-1
[3] prodlim_1.3.7 plyr_1.8
[5] C50_0.1.0-15 randomForest_4.6-7
[7] earth_3.2-6 plotrix_3.4-7
[9] plotmo_1.3-2 pls_2.3-0
[11] nnet_7.3-6 sparseLDA_0.1-6
[13] mda_0.4-2 elasticnet_1.1
[15] lars_1.2 MASS_7.3-26
[17] e1071_1.6-1 class_7.3-7
[19] rpart_4.1-1 doMC_1.3.0
[21] iterators_1.0.6 foreach_1.4.0
[23] caret_6.0-22 ggplot2_0.9.3.1
[25] lattice_0.20-15 tabplot_1.0
[27] ffbase_0.8 ff_2.2-11
[29] bit_1.1-10 Hmisc_3.10-1.1
[31] survival_2.37-4 AppliedPredictiveModeling_1.1-5
loaded via a namespace (and not attached):
[1] car_2.0-17 cluster_1.14.4 codetools_0.2-8 colorspace_1.2-2
[5] compiler_3.0.1 CORElearn_0.9.41 dichromat_2.0-0 digest_0.6.3
[9] gtable_0.1.2 KernSmooth_2.23-10 labeling_0.1 munsell_0.4
[13] proto_0.3-10 RColorBrewer_1.0-5 reshape2_1.2.2 scales_0.2.3
[17] stringr_0.6.2
>
> q("no")
> proc.time()
user system elapsed
492217.97 31824.96 39801.06
%%R -w 600 -h 600
## runChapterScript(17)
## user system elapsed
## 492217.97 31824.96 39801.06
NULL
%%R
showChapterScript(18)
NULL
%%R
showChapterOutput(18)
R Information
R version 3.0.1 (2013-05-16) -- "Good Sport"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin10.8.0 (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> ################################################################################
> ### R code from Applied Predictive Modeling (2013) by Kuhn and Johnson.
> ### Copyright 2013 Kuhn and Johnson
> ### Web Page: http://www.appliedpredictivemodeling.com
> ### Contact: Max Kuhn (mxkuhn@gmail.com)
> ###
> ### Chapter 18: Measuring Predictor Importance
> ###
> ### Required packages: AppliedPredictiveModeling, caret, CORElearn, corrplot,
> ### pROC, minerva
> ###
> ###
> ### Data used: The solubility data from the AppliedPredictiveModeling
> ### package, the segmentation data in the caret package and the
> ### grant data (created using "CreateGrantData.R" in the same
> ### directory as this file).
> ###
> ### Notes:
> ### 1) This code is provided without warranty.
> ###
> ### 2) This code should help the user reproduce the results in the
> ### text. There will be differences between this code and what is is
> ### the computing section. For example, the computing sections show
> ### how the source functions work (e.g. randomForest() or plsr()),
> ### which were not directly used when creating the book. Also, there may be
> ### syntax differences that occur over time as packages evolve. These files
> ### will reflect those changes.
> ###
> ### 3) In some cases, the calculations in the book were run in
> ### parallel. The sub-processes may reset the random number seed.
> ### Your results may slightly vary.
> ###
> ################################################################################
>
>
>
> ################################################################################
> ### Section 18.1 Numeric Outcomes
>
> ## Load the solubility data
>
> library(AppliedPredictiveModeling)
> data(solubility)
>
> trainData <- solTrainXtrans
> trainData$y <- solTrainY
>
>
> ## keep the continuous predictors and append the outcome to the data frame
> SolContPred <- solTrainXtrans[, !grepl("FP", names(solTrainXtrans))]
> numSolPred <- ncol(SolContPred)
> SolContPred$Sol <- solTrainY
>
> ## Get the LOESS smoother and the summary measure
> library(caret)
Loading required package: lattice
Loading required package: ggplot2
> smoother <- filterVarImp(x = SolContPred[, -ncol(SolContPred)],
+ y = solTrainY,
+ nonpara = TRUE)
Loading required package: pROC
Loading required package: plyr
Type 'citation("pROC")' for a citation.
Attaching package: ‘pROC’
The following object is masked from ‘package:stats’:
cov, smooth, var
> smoother$Predictor <- rownames(smoother)
> names(smoother)[1] <- "Smoother"
>
> ## Calculate the correlation matrices and keep the columns with the correlations
> ## between the predictors and the outcome
>
> correlations <- cor(SolContPred)[-(numSolPred+1),(numSolPred+1)]
> rankCorrelations <- cor(SolContPred, method = "spearman")[-(numSolPred+1),(numSolPred+1)]
> corrs <- data.frame(Predictor = names(SolContPred)[1:numSolPred],
+ Correlation = correlations,
+ RankCorrelation = rankCorrelations)
>
> ## The maximal information coefficient (MIC) values can be obtained from the
> ### minerva package:
>
> library(minerva)
> MIC <- mine(x = SolContPred[, 1:numSolPred], y = solTrainY)$MIC
> MIC <- data.frame(Predictor = rownames(MIC),
+ MIC = MIC[,1])
>
>
> ## The Relief values for regression can be computed using the CORElearn
> ## package:
>
> library(CORElearn)
Loading required package: cluster
Loading required package: rpart
> ReliefF <- attrEval(Sol ~ ., data = SolContPred,
+ estimator = "RReliefFequalK")
> ReliefF <- data.frame(Predictor = names(ReliefF),
+ Relief = ReliefF)
>
> ## Combine them all together for a plot
> contDescrScores <- merge(smoother, corrs)
> contDescrScores <- merge(contDescrScores, MIC)
> contDescrScores <- merge(contDescrScores, ReliefF)
>
> rownames(contDescrScores) <- contDescrScores$Predictor
>
> contDescrScores
Predictor Smoother Correlation RankCorrelation
HydrophilicFactor HydrophilicFactor 0.184455208 0.38598321 0.36469127
MolWeight MolWeight 0.444393085 -0.65852844 -0.68529880
NumAromaticBonds NumAromaticBonds 0.168645461 -0.41066466 -0.45787109
NumAtoms NumAtoms 0.189931478 -0.43581129 -0.51983173
NumBonds NumBonds 0.210717251 -0.45903949 -0.54839850
NumCarbon NumCarbon 0.368196173 -0.60679170 -0.67359114
NumChlorine NumChlorine 0.158529031 -0.39815704 -0.35707519
NumDblBonds NumDblBonds 0.002409996 0.04909171 -0.02042731
NumHalogen NumHalogen 0.157187646 -0.39646897 -0.38111965
NumHydrogen NumHydrogen 0.022654223 -0.15051320 -0.25592586
NumMultBonds NumMultBonds 0.230799468 -0.48041593 -0.47971353
NumNitrogen NumNitrogen 0.026032871 0.16134705 0.10078218
NumNonHAtoms NumNonHAtoms 0.340616555 -0.58362364 -0.62965400
NumNonHBonds NumNonHBonds 0.342455243 -0.58519676 -0.63228366
NumOxygen NumOxygen 0.045245139 0.21270905 0.14954994
NumRings NumRings 0.231183499 -0.48081545 -0.50941815
NumRotBonds NumRotBonds 0.013147325 -0.11466178 -0.14976036
NumSulfer NumSulfer 0.005865198 -0.07658458 -0.12090249
SurfaceArea1 SurfaceArea1 0.192535120 0.30325216 0.19339720
SurfaceArea2 SurfaceArea2 0.216936613 0.26663995 0.14057885
MIC Relief
HydrophilicFactor 0.3208456 0.140185965
MolWeight 0.4679277 0.084734907
NumAromaticBonds 0.2705170 0.050013692
NumAtoms 0.2896815 0.008618179
NumBonds 0.3268683 0.002422405
NumCarbon 0.4434121 0.061605610
NumChlorine 0.2011708 0.023813283
NumDblBonds 0.1688472 0.056997492
NumHalogen 0.2017841 0.045002621
NumHydrogen 0.1939521 0.075626122
NumMultBonds 0.2792600 0.051554380
NumNitrogen 0.1535738 0.168280773
NumNonHAtoms 0.3947092 0.036433860
NumNonHBonds 0.3919627 0.035619406
NumOxygen 0.1527421 0.123797003
NumRings 0.3161828 0.056263469
NumRotBonds 0.1754215 0.043556286
NumSulfer 0.1297052 0.062359034
SurfaceArea1 0.2054896 0.120727945
SurfaceArea2 0.2274047 0.117632188
>
> contDescrSplomData <- contDescrScores
> contDescrSplomData$Correlation <- abs(contDescrSplomData$Correlation)
> contDescrSplomData$RankCorrelation <- abs(contDescrSplomData$RankCorrelation)
> contDescrSplomData$Group <- "Other"
> contDescrSplomData$Group[grepl("Surface", contDescrSplomData$Predictor)] <- "SA"
>
> featurePlot(solTrainXtrans[, c("NumCarbon", "SurfaceArea2")],
+ solTrainY,
+ between = list(x = 1),
+ type = c("g", "p", "smooth"),
+ df = 3,
+ aspect = 1,
+ labels = c("", "Solubility"))
>
>
> splom(~contDescrSplomData[,c(3, 4, 2, 5)],
+ groups = contDescrSplomData$Group,
+ varnames = c("Correlation", "Rank\nCorrelation", "LOESS", "MIC"))
>
>
> ## Now look at the categorical (i.e. binary) predictors
> SolCatPred <- solTrainXtrans[, grepl("FP", names(solTrainXtrans))]
> SolCatPred$Sol <- solTrainY
> numSolCatPred <- ncol(SolCatPred) - 1
>
> tests <- apply(SolCatPred[, 1:numSolCatPred], 2,
+ function(x, y)
+ {
+ tStats <- t.test(y ~ x)[c("statistic", "p.value", "estimate")]
+ unlist(tStats)
+ },
+ y = solTrainY)
> ## The results are a matrix with predictors in columns. We reverse this
> tests <- as.data.frame(t(tests))
> names(tests) <- c("t.Statistic", "t.test_p.value", "mean0", "mean1")
> tests$difference <- tests$mean1 - tests$mean0
> tests
t.Statistic t.test_p.value mean0 mean1 difference
FP001 -4.02204024 6.287404e-05 -2.978465 -2.451471 0.526993515
FP002 10.28672686 1.351580e-23 -2.021347 -3.313860 -1.292512617
FP003 -2.03644225 4.198619e-02 -2.832164 -2.571855 0.260308757
FP004 -4.94895770 9.551772e-07 -3.128380 -2.427428 0.700951689
FP005 10.28247538 1.576549e-23 -1.969000 -3.262722 -1.293722323
FP006 -7.87583806 9.287835e-15 -3.109421 -2.133832 0.975589032
FP007 -0.88733923 3.751398e-01 -2.759967 -2.646185 0.113781971
FP008 3.32843788 9.119521e-04 -2.582652 -2.999613 -0.416960797
FP009 11.49360533 7.467714e-27 -2.249591 -3.926278 -1.676686955
FP010 -4.11392307 4.973603e-05 -2.824302 -2.232824 0.591478647
FP011 -7.01680213 1.067782e-11 -2.934645 -1.927353 1.007292306
FP012 -1.89255407 5.953582e-02 -2.773755 -2.461369 0.312385742
FP013 11.73267872 1.088092e-24 -2.365485 -4.490696 -2.125210704
FP014 11.47456176 1.157457e-23 -2.375401 -4.508431 -2.133030370
FP015 -7.73718733 1.432769e-12 -4.404286 -2.444487 1.959799162
FP016 -0.61719794 5.377695e-01 -2.733559 -2.631007 0.102551919
FP017 2.73915987 6.681864e-03 -2.654607 -3.098613 -0.444006259
FP018 4.26743510 2.806561e-05 -2.643402 -3.215280 -0.571878063
FP019 -2.31045847 2.207143e-02 -2.766910 -2.370603 0.396306731
FP020 -3.44119896 7.251032e-04 -2.785806 -2.224912 0.560894171
FP021 3.35165112 1.009498e-03 -2.642392 -3.272348 -0.629955482
FP022 -0.66772403 5.051252e-01 -2.728040 -2.637071 0.090969199
FP023 2.18958532 2.989162e-02 -2.673106 -3.042650 -0.369544057
FP024 -2.43189276 1.617811e-02 -2.766457 -2.340841 0.425616224
FP025 -2.68651403 7.981132e-03 -2.771677 -2.312545 0.459131121
FP026 0.58596455 5.591541e-01 -2.709082 -2.821875 -0.112793485
FP027 -4.46177875 1.714807e-05 -2.793800 -2.024516 0.769283405
FP028 -3.36478123 1.011310e-03 -2.791941 -2.101089 0.690852068
FP029 1.50309317 1.346711e-01 -2.696475 -2.913093 -0.216617374
FP030 -4.18564626 5.684141e-05 -2.799582 -1.933933 0.865649782
FP031 -0.19030898 8.494207e-01 -2.721986 -2.683765 0.038221437
FP032 -2.86824205 5.100440e-03 -2.757832 -2.224429 0.533403438
FP033 -2.48343886 1.492327e-02 -2.751062 -2.282879 0.468183359
FP034 0.81786492 4.147985e-01 -2.709737 -2.820263 -0.110526015
FP035 4.17698556 6.851675e-05 -2.659660 -3.471594 -0.811934339
FP036 -5.31186085 6.344823e-07 -2.787224 -1.880417 0.906807452
FP037 1.37213471 1.734895e-01 -2.700271 -2.960000 -0.259728507
FP038 -2.55044552 1.224045e-02 -2.764833 -2.228293 0.536540459
FP039 6.83856010 1.396591e-09 -2.588330 -4.332817 -1.744487356
FP040 -4.96957478 3.640553e-06 -2.788036 -1.771692 1.016343810
FP041 3.86443922 2.274448e-04 -2.672424 -3.403833 -0.731409091
FP042 -1.10149897 2.742144e-01 -2.729509 -2.536852 0.192657624
FP043 -0.18525729 8.535189e-01 -2.721284 -2.680317 0.040966323
FP044 15.19844350 1.458342e-22 -2.472237 -6.582105 -4.109868127
FP045 3.26197779 1.781037e-03 -2.678118 -3.403962 -0.725844224
FP046 7.19096539 1.949765e-12 -2.405146 -3.398700 -0.993554071
FP047 3.08813847 2.106659e-03 -2.611605 -3.013676 -0.402071305
FP048 0.78156187 4.354510e-01 -2.703337 -2.826102 -0.122764360
FP049 9.32620107 1.541509e-16 -2.494036 -4.334828 -1.840791658
FP050 1.78989997 7.537562e-02 -2.684810 -2.984860 -0.300049387
FP051 3.85923300 1.590148e-04 -2.656482 -3.224231 -0.567749069
FP052 -1.37622794 1.707261e-01 -2.736296 -2.542529 0.193767561
FP053 7.79872544 3.863769e-12 -2.565418 -4.201910 -1.636492479
FP054 4.71268264 7.815108e-06 -2.656678 -3.474167 -0.817488623
FP055 -2.15047129 3.539774e-02 -2.743122 -2.285294 0.457828105
FP056 6.56517336 8.289424e-09 -2.598841 -4.435323 -1.836481186
FP057 1.55970276 1.207241e-01 -2.686667 -2.952807 -0.266140351
FP058 1.31266618 1.913070e-01 -2.691483 -2.930000 -0.238517200
FP059 5.30327181 1.388228e-06 -2.662258 -3.692115 -1.029857320
FP060 -6.34967826 3.396521e-10 -3.112819 -2.294192 0.818627333
FP061 -3.23528852 1.258017e-03 -2.903859 -2.489247 0.414612257
FP062 -4.68040368 3.284921e-06 -2.978056 -2.384856 0.593200306
FP063 -5.90647947 4.865776e-09 -3.037509 -2.288593 0.748916565
FP064 -3.19849081 1.427257e-03 -2.887640 -2.481616 0.406023478
FP065 13.67947483 7.369864e-39 -1.740827 -3.389468 -1.648641212
FP066 -3.50425986 4.936856e-04 -3.034043 -2.516776 0.517267265
FP067 -3.71025855 2.192910e-04 -2.894797 -2.430554 0.464242594
FP068 -4.50468714 7.534223e-06 -2.923921 -2.356221 0.567699992
FP069 -1.39582672 1.631126e-01 -2.782438 -2.605872 0.176566128
FP070 11.33500604 6.532630e-27 -2.155840 -3.739142 -1.583301881
FP071 9.16039412 1.012284e-18 -2.295828 -3.588521 -1.292692775
FP072 -9.86673490 4.502526e-21 -3.674277 -2.222396 1.451880757
FP073 -6.31556184 4.773987e-10 -2.972104 -2.154780 0.817323998
FP074 -3.16365915 1.617158e-03 -2.849299 -2.446958 0.402341137
FP075 -4.83159241 1.618286e-06 -2.926916 -2.311584 0.615331888
FP076 18.19671006 2.170836e-57 -1.949953 -4.292756 -2.342803359
FP077 -0.24434665 8.070283e-01 -2.728715 -2.697082 0.031633203
FP078 -0.49694487 6.193690e-01 -2.737523 -2.675156 0.062366949
FP079 12.46647477 2.609452e-32 -1.649763 -3.199207 -1.549444605
FP080 -4.44534892 1.029202e-05 -2.896848 -2.308160 0.588687940
FP081 0.11125946 9.114457e-01 -2.714519 -2.729057 -0.014537653
FP082 12.55490234 3.329065e-32 -1.573824 -3.177143 -1.603319328
FP083 -6.28835488 5.760827e-10 -2.932735 -2.149385 0.783350551
FP084 -3.43524930 6.332047e-04 -2.851414 -2.386949 0.464465314
FP085 10.47209331 1.134762e-22 -2.307585 -3.916008 -1.608423485
FP086 1.02088695 3.077271e-01 -2.682101 -2.817578 -0.135477406
FP087 11.07193302 5.850147e-26 -1.684808 -3.107540 -1.422732105
FP088 -4.82078133 1.873320e-06 -2.891398 -2.233960 0.657438003
FP089 15.68684642 7.559612e-42 -2.131606 -4.506936 -2.375330025
FP090 0.72850761 4.666345e-01 -2.693950 -2.792743 -0.098793036
FP091 -1.97821299 4.847758e-02 -2.777626 -2.515187 0.262438593
FP092 12.71461669 9.160201e-31 -2.250250 -4.169957 -1.919706549
FP093 2.40580805 1.652056e-02 -2.636787 -2.972026 -0.335238658
FP094 -1.08529331 2.783195e-01 -2.751874 -2.607909 0.143965054
FP095 -4.83150303 1.885749e-06 -2.863571 -2.203780 0.659791524
FP096 -0.05816460 9.536450e-01 -2.720323 -2.712271 0.008052049
FP097 9.06740092 4.508890e-18 -2.420977 -3.684420 -1.263443027
FP098 -3.09495737 2.088014e-03 -2.820538 -2.391460 0.429077754
FP099 4.51553294 8.153915e-06 -2.575959 -3.203843 -0.627883409
FP100 -4.26730797 2.354655e-05 -2.846430 -2.293727 0.552702276
FP101 -3.33565277 9.211008e-04 -2.828760 -2.363022 0.465738108
FP102 1.25032500 2.119440e-01 -2.683373 -2.857708 -0.174335474
FP103 2.51185846 1.236590e-02 -2.644038 -2.984808 -0.340770007
FP104 1.23433987 2.176989e-01 -2.681746 -2.846934 -0.165188360
FP105 2.56644125 1.063908e-02 -2.640201 -3.003756 -0.363555025
FP106 2.42187970 1.595574e-02 -2.652367 -2.998297 -0.345929993
FP107 10.92623859 2.395320e-23 -2.328707 -4.173284 -1.844576915
FP108 -0.88386799 3.773218e-01 -2.744087 -2.619641 0.124446276
FP109 1.72666429 8.493856e-02 -2.681392 -2.891845 -0.210453156
FP110 -4.30633122 2.083157e-05 -2.839272 -2.253622 0.585649074
FP111 0.07891212 9.371465e-01 -2.716361 -2.727594 -0.011232326
FP112 13.31169435 4.090297e-31 -2.293512 -4.478541 -2.185028791
FP113 -4.25438885 2.743420e-05 -2.842824 -2.207527 0.635296648
FP114 0.38442341 7.009005e-01 -2.711034 -2.759459 -0.048425836
FP115 -0.49398272 6.216320e-01 -2.730653 -2.663059 0.067594185
FP116 -3.39726200 7.657795e-04 -2.815911 -2.310055 0.505856814
FP117 3.16005628 1.769096e-03 -2.623060 -3.157353 -0.534292762
FP118 -3.88255786 1.272871e-04 -2.835755 -2.226776 0.608979252
FP119 -0.71996857 4.720764e-01 -2.734485 -2.636839 0.097646215
FP120 -3.25854728 1.280523e-03 -2.807793 -2.270759 0.537033697
FP121 0.62156119 5.349141e-01 -2.704487 -2.805188 -0.100701417
FP122 -2.44169102 1.530759e-02 -2.781836 -2.396154 0.385682632
FP123 3.52755166 4.929055e-04 -2.628914 -3.165157 -0.536243091
FP124 -3.58983366 3.953044e-04 -2.806888 -2.261494 0.545394825
FP125 -2.91655379 3.853055e-03 -2.786364 -2.350743 0.435620393
FP126 -1.44180023 1.505173e-01 -2.748395 -2.547234 0.201161019
FP127 -2.66597987 8.213408e-03 -2.773386 -2.381429 0.391957737
FP128 -3.37747584 8.536233e-04 -2.794086 -2.284752 0.509334647
FP129 3.28855844 1.192299e-03 -2.642100 -3.193030 -0.550930181
FP130 1.02990587 3.048783e-01 -2.698555 -2.888900 -0.190345358
FP131 -0.49682548 6.198471e-01 -2.727954 -2.653583 0.074370939
FP132 -5.89680424 1.633112e-08 -2.832055 -1.925126 0.906929238
FP133 -1.83896087 6.756107e-02 -2.757100 -2.451750 0.305349880
FP134 3.16620016 1.761695e-03 -2.661506 -3.110000 -0.448493976
FP135 -2.94236705 3.709259e-03 -2.783827 -2.266667 0.517160048
FP136 -2.02006233 4.501990e-02 -2.761938 -2.403304 0.358633451
FP137 -0.07855180 9.374873e-01 -2.720131 -2.706636 0.013494433
FP138 -1.44829927 1.496787e-01 -2.748083 -2.483302 0.264780953
FP139 -0.22212826 8.246439e-01 -2.721936 -2.680897 0.041038417
FP140 -1.86990507 6.355486e-02 -2.758036 -2.403962 0.354073239
FP141 4.15441700 4.792655e-05 -2.650655 -3.232523 -0.581867761
FP142 -2.92307611 4.047862e-03 -2.779233 -2.224519 0.554713355
FP143 0.83414756 4.061300e-01 -2.705904 -2.862338 -0.156433772
FP144 -4.98991305 1.904653e-06 -2.819214 -1.852424 0.966789373
FP145 -3.99831545 1.002597e-04 -2.787077 -2.128990 0.658087566
FP146 6.08904552 1.064009e-08 -2.608687 -3.675000 -1.066313013
FP147 -2.98364059 3.376138e-03 -2.776357 -2.226800 0.549557227
FP148 -4.00444775 1.101041e-04 -2.780300 -2.073012 0.707287491
FP149 9.67498002 8.530838e-16 -2.479225 -5.125930 -2.646704799
FP150 -1.59224059 1.145443e-01 -2.742808 -2.435467 0.307341553
FP151 -1.68674372 9.608846e-02 -2.736013 -2.423019 0.312994495
FP152 2.02103329 4.549820e-02 -2.692325 -3.012308 -0.319982377
FP153 0.83775227 4.044086e-01 -2.703900 -2.892432 -0.188532775
FP154 -0.18701160 8.526043e-01 -2.720525 -2.668889 0.051635701
FP155 4.93743429 3.813516e-06 -2.653412 -3.592273 -0.938860298
FP156 2.70254904 8.178498e-03 -2.685045 -3.160896 -0.475850274
FP157 -1.19798365 2.351567e-01 -2.738105 -2.423220 0.314885042
FP158 -3.18371959 2.293303e-03 -2.757078 -2.039020 0.718058170
FP159 2.90626659 4.444806e-03 -2.687590 -3.127313 -0.439722935
FP160 0.72930617 4.673596e-01 -2.711400 -2.816308 -0.104908144
FP161 -8.02084404 8.158474e-12 -2.826779 -1.193333 1.633445946
FP162 9.05654884 7.502729e-19 -2.147208 -3.300849 -1.153640924
FP163 -4.73411111 2.565152e-06 -3.009759 -2.398455 0.611304290
FP164 11.15556043 6.131703e-27 -1.830706 -3.245042 -1.414335661
FP165 -3.26163144 1.150990e-03 -2.862294 -2.450602 0.411691613
FP166 6.01599552 3.059094e-09 -2.441541 -3.277905 -0.836363881
FP167 -3.77468033 1.718080e-04 -2.874742 -2.398718 0.476023835
FP168 12.78784085 6.302482e-34 -1.659686 -3.250521 -1.590835792
FP169 10.79840624 1.952902e-22 -2.370413 -4.241017 -1.870603512
FP170 1.45059296 1.480425e-01 -2.674961 -2.911943 -0.236981517
FP171 -3.56151646 4.354270e-04 -2.810722 -2.266398 0.544324003
FP172 13.04070659 8.112523e-28 -2.345390 -4.809931 -2.464540221
FP173 2.68918003 7.770466e-03 -2.653554 -3.111556 -0.458001634
FP174 0.94721964 3.446525e-01 -2.699492 -2.845806 -0.146314311
FP175 0.01020115 9.918704e-01 -2.718360 -2.719922 -0.001562215
FP176 -2.29447613 2.298911e-02 -2.766395 -2.374310 0.392084865
FP177 -1.08253877 2.802959e-01 -2.737548 -2.580609 0.156939151
FP178 3.27582610 1.258481e-03 -2.656782 -3.167739 -0.510956834
FP179 0.85670987 3.931634e-01 -2.703846 -2.854409 -0.150562448
FP180 -2.83913345 5.188161e-03 -2.773274 -2.263235 0.510039146
FP181 6.24259165 6.005980e-09 -2.617726 -3.695281 -1.077554681
FP182 -2.11887211 3.595632e-02 -2.755239 -2.384255 0.370983887
FP183 -2.62186301 1.015591e-02 -2.755210 -2.271250 0.483960466
FP184 10.24979020 9.572172e-17 -2.493318 -5.171000 -2.677681975
FP185 3.21519455 1.718715e-03 -2.667230 -3.270000 -0.602770115
FP186 -2.10893733 3.756740e-02 -2.749818 -2.342740 0.407078042
FP187 -0.14233858 8.871705e-01 -2.721122 -2.685942 0.035180420
FP188 -2.76497219 7.083803e-03 -2.760011 -2.153692 0.606318979
FP189 0.29230393 7.707177e-01 -2.713884 -2.774932 -0.061047680
FP190 8.23796541 2.799252e-12 -2.574785 -4.556522 -1.981737159
FP191 -1.62000293 1.089976e-01 -2.742364 -2.404627 0.337737388
FP192 0.55100083 5.833593e-01 -2.711377 -2.829310 -0.117932965
FP193 11.06173597 1.595927e-16 -2.525146 -5.642881 -3.117735616
FP194 -1.03294441 3.047671e-01 -2.728916 -2.553214 0.175701915
FP195 -5.88072667 1.035398e-07 -2.786495 -1.672759 1.113736340
FP196 6.42707826 1.269199e-08 -2.651126 -3.838889 -1.187762913
FP197 3.82944792 3.167065e-04 -2.670555 -3.583800 -0.913245061
FP198 -3.87872401 2.598433e-04 -2.776165 -1.761852 1.014313143
FP199 0.59118217 5.569865e-01 -2.711578 -2.859333 -0.147754967
FP200 5.15622561 3.020793e-06 -2.668319 -3.685106 -1.016787799
FP201 -3.92629512 2.100852e-04 -2.757414 -2.018600 0.738813984
FP202 5.92935333 6.082278e-09 -2.496969 -3.357143 -0.860174019
FP203 1.09341446 2.759667e-01 -2.695582 -2.896147 -0.200564841
FP204 2.86078975 4.868444e-03 -2.672159 -3.141702 -0.469543435
FP205 5.61427744 2.488511e-07 -2.605564 -4.057838 -1.452273414
FP206 3.58353985 6.162975e-04 -2.674519 -3.409474 -0.734954669
FP207 8.34894566 1.153650e-11 -2.595151 -4.768704 -2.173553202
FP208 1.37823055 1.702203e-01 -2.690237 -2.942056 -0.251819108
>
> ## Create a volcano plot
>
> xyplot(-log10(t.test_p.value) ~ difference,
+ data = tests,
+ xlab = "Mean With Structure - Mean Without Structure",
+ ylab = "-log(p-Value)",
+ type = "p")
>
> ################################################################################
> ### Section 18.2 Categorical Outcomes
>
> ## Load the segmentation data
>
> data(segmentationData)
> segTrain <- subset(segmentationData, Case == "Train")
> segTrain$Case <- segTrain$Cell <- NULL
>
> segTest <- subset(segmentationData, Case != "Train")
> segTest$Case <- segTest$Cell <- NULL
>
> ## Compute the areas under the ROC curve
> aucVals <- filterVarImp(x = segTrain[, -1], y = segTrain$Class)
> aucVals$Predictor <- rownames(aucVals)
>
> ## Cacluate the t-tests as before but with x and y switched
> segTests <- apply(segTrain[, -1], 2,
+ function(x, y)
+ {
+ tStats <- t.test(x ~ y)[c("statistic", "p.value", "estimate")]
+ unlist(tStats)
+ },
+ y = segTrain$Class)
> segTests <- as.data.frame(t(segTests))
> names(segTests) <- c("t.Statistic", "t.test_p.value", "mean0", "mean1")
> segTests$Predictor <- rownames(segTests)
>
> ## Fit a random forest model and get the importance scores
> library(randomForest)
randomForest 4.6-7
Type rfNews() to see new features/changes/bug fixes.
> set.seed(791)
> rfImp <- randomForest(Class ~ ., data = segTrain,
+ ntree = 2000,
+ importance = TRUE)
> rfValues <- data.frame(RF = importance(rfImp)[, "MeanDecreaseGini"],
+ Predictor = rownames(importance(rfImp)))
>
> ## Now compute the Relief scores
> set.seed(791)
>
> ReliefValues <- attrEval(Class ~ ., data = segTrain,
+ estimator="ReliefFequalK", ReliefIterations = 50)
> ReliefValues <- data.frame(Relief = ReliefValues,
+ Predictor = names(ReliefValues))
>
> ## and the MIC statistics
> set.seed(791)
> segMIC <- mine(x = segTrain[, -1],
+ ## Pass the outcome as 0/1
+ y = ifelse(segTrain$Class == "PS", 1, 0))$MIC
> segMIC <- data.frame(Predictor = rownames(segMIC),
+ MIC = segMIC[,1])
>
>
> rankings <- merge(segMIC, ReliefValues)
> rankings <- merge(rankings, rfValues)
> rankings <- merge(rankings, segTests)
> rankings <- merge(rankings, aucVals)
> rankings
Predictor MIC Relief RF t.Statistic
1 AngleCh1 0.131057008 0.002287557 4.730963 -0.21869850
2 AreaCh1 0.108083908 0.016041257 4.315317 -0.93160658
3 AvgIntenCh1 0.292046076 0.071057681 18.865802 -11.75400848
4 AvgIntenCh2 0.329484594 0.150684824 21.857848 -16.09400822
5 AvgIntenCh3 0.135443794 0.018172519 5.135363 -0.14752973
6 AvgIntenCh4 0.166545039 -0.007167866 5.434737 -6.23725001
7 ConvexHullAreaRatioCh1 0.299627157 0.035983697 19.093048 14.22756193
8 ConvexHullPerimRatioCh1 0.254931744 0.041865999 12.624038 -13.86697029
9 DiffIntenDensityCh1 0.239224382 0.038582763 7.335741 -9.81721615
10 DiffIntenDensityCh3 0.133084659 0.010830941 6.647198 1.48785690
11 DiffIntenDensityCh4 0.147643832 0.042352546 5.386981 -5.54840221
12 EntropyIntenCh1 0.261097110 0.129280729 13.867582 -14.04326173
13 EntropyIntenCh3 0.172122729 0.039687246 5.127465 6.94689541
14 EntropyIntenCh4 0.185625627 0.021260676 5.742739 -9.03621024
15 EqCircDiamCh1 0.108083908 0.038820971 4.185607 -1.85186912
16 EqEllipseLWRCh1 0.212579943 0.016550609 5.708705 9.83868863
17 EqEllipseOblateVolCh1 0.122276159 0.010367074 3.906543 1.35616134
18 EqEllipseProlateVolCh1 0.169674904 -0.005386670 6.018121 -1.29243801
19 EqSphereAreaCh1 0.108083908 0.016110539 4.183567 -0.93273061
20 EqSphereVolCh1 0.108083908 0.003440003 4.133475 -0.04348657
21 FiberAlign2Ch3 0.177116842 -0.002628403 4.373886 3.65095007
22 FiberAlign2Ch4 0.149937844 0.016047962 4.868552 2.07009183
23 FiberLengthCh1 0.220505513 0.050610471 8.368712 9.26429955
24 FiberWidthCh1 0.368720274 0.107691201 33.371913 -18.96852051
25 IntenCoocASMCh3 0.196466490 0.024738010 7.298595 -7.95107008
26 IntenCoocASMCh4 0.147981004 0.005574684 3.734085 4.51016239
27 IntenCoocContrastCh3 0.231500707 0.021282305 8.438533 13.20540372
28 IntenCoocContrastCh4 0.135150335 -0.002605380 4.567712 1.02551789
29 IntenCoocEntropyCh3 0.202905819 0.039769279 6.354566 9.62738946
30 IntenCoocEntropyCh4 0.148928924 0.042214966 4.234247 -5.73801017
31 IntenCoocMaxCh3 0.193078547 0.039834486 6.865277 -10.01109754
32 IntenCoocMaxCh4 0.152580596 0.064488810 3.966995 5.02868895
33 KurtIntenCh1 0.200874103 0.003243188 7.095402 3.18226166
34 KurtIntenCh3 0.135694293 0.010944913 4.237905 -2.46783420
35 KurtIntenCh4 0.152775633 0.011328311 5.339427 4.39807449
36 LengthCh1 0.149378763 0.044483732 4.235474 5.28480181
37 NeighborAvgDistCh1 0.123412342 0.023330722 4.266566 -0.46614250
38 NeighborMinDistCh1 0.125623472 0.007850922 5.152365 0.80769702
39 NeighborVarDistCh1 0.124259322 0.016447793 4.286239 0.29886752
40 PerimCh1 0.170013515 0.025272254 4.115593 6.18542523
41 ShapeBFRCh1 0.235667275 0.005194794 9.782458 -13.25311412
42 ShapeLWRCh1 0.183599199 0.029568271 4.745873 8.40241429
43 ShapeP2ACh1 0.332238080 0.073795605 19.362332 14.75801555
44 SkewIntenCh1 0.259680600 0.085229983 13.628434 9.66411304
45 SkewIntenCh3 0.149153858 0.056669970 4.244103 -3.76453794
46 SkewIntenCh4 0.152202895 0.002508761 5.478398 6.46619794
47 SpotFiberCountCh3 0.005721744 -0.005692308 1.793200 -0.53238018
48 SpotFiberCountCh4 0.019496167 -0.015192982 2.948225 2.98634139
49 TotalIntenCh1 0.304429766 0.045548534 20.916993 -8.20041297
50 TotalIntenCh2 0.400952572 0.185416030 41.617068 -14.54087193
51 TotalIntenCh3 0.115771733 0.015068883 5.402005 -0.46828755
52 TotalIntenCh4 0.186643156 0.006071748 5.712561 -5.64791505
53 VarIntenCh1 0.241235863 0.045687478 9.259561 -10.40110966
54 VarIntenCh3 0.150238051 0.002815999 5.176123 -2.44172596
55 VarIntenCh4 0.171222193 0.001547820 5.981325 -4.83455579
56 WidthCh1 0.146204548 0.021560423 5.113884 -1.59227638
57 XCentroid 0.106662637 -0.037877551 4.220162 1.10633278
58 YCentroid 0.119516938 0.055209622 4.908536 2.19081435
t.test_p.value mean0 mean1 PS WS
1 8.269443e-01 9.086539e+01 9.157148e+01 0.5025967 0.5025967
2 3.517830e-01 3.205519e+02 3.329249e+02 0.5709170 0.5709170
3 4.819837e-28 7.702212e+01 2.146922e+02 0.7662375 0.7662375
4 2.530403e-50 1.324405e+02 2.778397e+02 0.7866146 0.7866146
5 8.827553e-01 9.578766e+01 9.671147e+01 0.5214098 0.5214098
6 7.976250e-10 1.168287e+02 1.795797e+02 0.6473814 0.6473814
7 5.895088e-42 1.270408e+00 1.114054e+00 0.7815519 0.7815519
8 4.644231e-40 8.714806e-01 9.310403e-01 0.7547844 0.7547844
9 6.509740e-21 6.055821e+01 9.601373e+01 0.7161591 0.7161591
10 1.371842e-01 7.753072e+01 7.104993e+01 0.5427353 0.5427353
11 4.178896e-08 7.508542e+01 1.061125e+02 0.6294704 0.6294704
12 5.145995e-40 6.364841e+00 7.004622e+00 0.7565169 0.7565169
13 8.836060e-12 5.704662e+00 5.014508e+00 0.6340145 0.6340145
14 9.775620e-19 5.192365e+00 6.023039e+00 0.6661861 0.6661861
15 6.437960e-02 1.940093e+01 2.002646e+01 0.5709170 0.5709170
16 7.218411e-22 2.371177e+00 1.758240e+00 0.6965915 0.6965915
17 1.753561e-01 7.632288e+02 6.866693e+02 0.5045568 0.5045568
18 1.965213e-01 3.543481e+02 3.920429e+02 0.6301870 0.6301870
19 3.512025e-01 1.284179e+03 1.333731e+03 0.5709170 0.5709170
20 9.653226e-01 5.017110e+03 5.033648e+03 0.5709170 0.5709170
21 2.770065e-04 1.479185e+00 1.421565e+00 0.5690728 0.5690728
22 3.873106e-02 1.444148e+00 1.412867e+00 0.5421535 0.5421535
23 1.239044e-19 3.991835e+01 2.819142e+01 0.7007984 0.7007984
24 1.162284e-64 8.691444e+00 1.282684e+01 0.8355127 0.8355127
25 1.067683e-14 7.373161e-02 1.559897e-01 0.6956093 0.6956093
26 7.290850e-06 1.131789e-01 7.724074e-02 0.5878438 0.5878438
27 7.794899e-37 1.163875e+01 6.292079e+00 0.7214199 0.7214199
28 3.053656e-01 7.700191e+00 7.343397e+00 0.5358642 0.5358642
29 1.282007e-20 6.201308e+00 5.216667e+00 0.6891345 0.6891345
30 1.313352e-08 5.545934e+00 6.032306e+00 0.6073356 0.6073356
31 4.418432e-22 1.900393e-01 3.245564e-01 0.6944627 0.6944627
32 5.990072e-07 2.707207e-01 2.131262e-01 0.5892938 0.5892938
33 1.506054e-03 1.208829e+00 3.868323e-01 0.6711982 0.6711982
34 1.388162e-02 3.121647e+00 4.480168e+00 0.5513936 0.5513936
35 1.210957e-05 1.388322e+00 2.421078e-01 0.6046335 0.6046335
36 1.571520e-07 3.237304e+01 2.839838e+01 0.6015142 0.6015142
37 6.412508e-01 2.294382e+02 2.307292e+02 0.5047676 0.5047676
38 4.194740e-01 3.020875e+01 2.962558e+01 0.5018274 0.5018274
39 7.651196e-01 1.046047e+02 1.042038e+02 0.5072546 0.5072546
40 9.075622e-10 9.721959e+01 8.203652e+01 0.6200196 0.6200196
41 6.819382e-37 5.630603e-01 6.406694e-01 0.7319836 0.7319836
42 1.498789e-16 1.968091e+00 1.601640e+00 0.6607778 0.6607778
43 9.265729e-45 2.380621e+00 1.606325e+00 0.7930978 0.7930978
44 6.631564e-21 8.687084e-01 4.124373e-01 0.7253275 0.7253275
45 1.819323e-04 1.429871e+00 1.711829e+00 0.5732881 0.5732881
46 1.592246e-10 1.069003e+00 7.366442e-01 0.6193873 0.6193873
47 5.946089e-01 1.915094e+00 1.970509e+00 0.5173630 0.5173630
48 2.894728e-03 7.224843e+00 6.477212e+00 0.4619775 0.4619775
49 1.624963e-15 2.494150e+04 6.265354e+04 0.7895358 0.7895358
50 3.385024e-43 3.858694e+04 7.665351e+04 0.8012840 0.8012840
51 6.397155e-01 2.685926e+04 2.770986e+04 0.5094972 0.5094972
52 2.290183e-08 3.466429e+04 5.217025e+04 0.6599073 0.6599073
53 5.662429e-23 5.142099e+01 1.136596e+02 0.7322365 0.7322365
54 1.488950e-02 9.519852e+01 1.127093e+02 0.5330821 0.5330821
55 1.632212e-06 1.063653e+02 1.430475e+02 0.6322357 0.6322357
56 1.116486e-01 1.754162e+01 1.813792e+01 0.5799484 0.5799484
57 2.689098e-01 2.698852e+02 2.599759e+02 0.5216669 0.5216669
58 2.875168e-02 1.842972e+02 1.691475e+02 0.5407878 0.5407878
>
> rankings$channel <- "Channel 1"
> rankings$channel[grepl("Ch2$", rankings$Predictor)] <- "Channel 2"
> rankings$channel[grepl("Ch3$", rankings$Predictor)] <- "Channel 3"
> rankings$channel[grepl("Ch4$", rankings$Predictor)] <- "Channel 4"
> rankings$t.Statistic <- abs(rankings$t.Statistic)
>
> splom(~rankings[, c("PS", "t.Statistic", "RF", "Relief", "MIC")],
+ groups = rankings$channel,
+ varnames = c("ROC\nAUC", "Abs\nt-Stat", "Random\nForest", "Relief", "MIC"),
+ auto.key = list(columns = 2))
>
>
> ## Load the grant data. A script to create and save these data is contained
> ## in the same directory as this file.
>
> load("grantData.RData")
>
> dataSubset <- training[pre2008, c("Sponsor62B", "ContractValueBandUnk", "RFCD240302")]
>
> ## This is a simple function to compute several statistics for binary predictors
> tableCalcs <- function(x, y)
+ {
+ tab <- table(x, y)
+ fet <- fisher.test(tab)
+ out <- c(OR = fet$estimate,
+ P = fet$p.value,
+ Gain = attrEval(y ~ x, estimator = "GainRatio"))
+ }
>
> ## lapply() is used to execute the function on each column
> tableResults <- lapply(dataSubset, tableCalcs, y = training[pre2008, "Class"])
>
> ## The results come back as a list of vectors, and "rbind" is used to join
> ## then together as rows of a table
> tableResults <- do.call("rbind", tableResults)
> tableResults
OR.odds ratio P Gain.x
Sponsor62B 6.040826 2.643795e-07 0.0472613504
ContractValueBandUnk 6.294236 1.718209e-263 0.1340764356
RFCD240302 1.097565 8.515664e-01 0.0001664263
>
> ## The permuted Relief scores can be computed using a function from the
> ## AppliedPredictiveModeling package.
>
> permuted <- permuteRelief(x = training[pre2008, c("Sponsor62B", "Day", "NumCI")],
+ y = training[pre2008, "Class"],
+ nperm = 500,
+ ### the remaining options are passed to attrEval()
+ estimator="ReliefFequalK",
+ ReliefIterations= 50)
>
> ## The original Relief scores:
> permuted$observed
Sponsor62B Day NumCI
0.000000000 0.036490637 -0.009047619
>
> ## The number of standard deviations away from the permuted mean:
> permuted$standardized
Sponsor62B Day NumCI
-0.08258544 4.50898453 -1.07569741
>
> ## The distributions of the scores if there were no relationship between the
> ## predictors and outcomes
>
> histogram(~value|Predictor,
+ data = permuted$permutations,
+ xlim = extendrange(permuted$permutations$value),
+ xlab = "Relief Score")
>
>
> ################################################################################
> ### Session Information
>
> sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] randomForest_4.6-7 CORElearn_0.9.41
[3] rpart_4.1-1 cluster_1.14.4
[5] minerva_1.3 pROC_1.5.4
[7] plyr_1.8 caret_6.0-22
[9] ggplot2_0.9.3.1 lattice_0.20-15
[11] AppliedPredictiveModeling_1.1-5
loaded via a namespace (and not attached):
[1] car_2.0-16 codetools_0.2-8 colorspace_1.2-1 dichromat_2.0-0
[5] digest_0.6.3 foreach_1.4.0 grid_3.0.1 gtable_0.1.2
[9] iterators_1.0.6 labeling_0.1 MASS_7.3-26 munsell_0.4
[13] parallel_3.0.1 proto_0.3-10 RColorBrewer_1.0-5 reshape2_1.2.2
[17] scales_0.2.3 stringr_0.6.2 tools_3.0.1
>
> q("no")
> proc.time()
user system elapsed
78.161 0.635 79.081
%%R -w 600 -h 600
runChapterScript(18)
## user system elapsed
## 78.161 0.635 79.081
NULL
%%R
### Section 18.1 Numeric Outcomes
## Load the solubility data
library(AppliedPredictiveModeling)
data(solubility)
trainData <- solTrainXtrans
trainData$y <- solTrainY
## keep the continuous predictors and append the outcome to the data frame
SolContPred <- solTrainXtrans[, !grepl("FP", names(solTrainXtrans))]
numSolPred <- ncol(SolContPred)
SolContPred$Sol <- solTrainY
## Get the LOESS smoother and the summary measure
library(caret)
smoother <- filterVarImp(x = SolContPred[, -ncol(SolContPred)],
y = solTrainY,
nonpara = TRUE)
smoother$Predictor <- rownames(smoother)
names(smoother)[1] <- "Smoother"
## Calculate the correlation matrices and keep the columns with the correlations
## between the predictors and the outcome
correlations <- cor(SolContPred)[-(numSolPred+1),(numSolPred+1)]
rankCorrelations <- cor(SolContPred, method = "spearman")[-(numSolPred+1),(numSolPred+1)]
corrs <- data.frame(Predictor = names(SolContPred)[1:numSolPred],
Correlation = correlations,
RankCorrelation = rankCorrelations)
## The maximal information coefficient (MIC) values can be obtained from the
### minerva package:
library(minerva)
MIC <- mine(x = SolContPred[, 1:numSolPred], y = solTrainY)$MIC
MIC <- data.frame(Predictor = rownames(MIC),
MIC = MIC[,1])
## The Relief values for regression can be computed using the CORElearn
## package:
library(CORElearn)
ReliefF <- attrEval(Sol ~ ., data = SolContPred,
estimator = "RReliefFequalK")
ReliefF <- data.frame(Predictor = names(ReliefF),
Relief = ReliefF)
## Combine them all together for a plot
contDescrScores <- merge(smoother, corrs)
contDescrScores <- merge(contDescrScores, MIC)
contDescrScores <- merge(contDescrScores, ReliefF)
rownames(contDescrScores) <- contDescrScores$Predictor
print(
contDescrScores
)
contDescrSplomData <- contDescrScores
contDescrSplomData$Correlation <- abs(contDescrSplomData$Correlation)
contDescrSplomData$RankCorrelation <- abs(contDescrSplomData$RankCorrelation)
contDescrSplomData$Group <- "Other"
contDescrSplomData$Group[grepl("Surface", contDescrSplomData$Predictor)] <- "SA"
Predictor Smoother Correlation RankCorrelation
HydrophilicFactor HydrophilicFactor 0.184455208 0.38598321 0.36469127
MolWeight MolWeight 0.444393085 -0.65852844 -0.68529880
NumAromaticBonds NumAromaticBonds 0.168645461 -0.41066466 -0.45787109
NumAtoms NumAtoms 0.189931478 -0.43581129 -0.51983173
NumBonds NumBonds 0.210717251 -0.45903949 -0.54839850
NumCarbon NumCarbon 0.368196173 -0.60679170 -0.67359114
NumChlorine NumChlorine 0.158529031 -0.39815704 -0.35707519
NumDblBonds NumDblBonds 0.002409996 0.04909171 -0.02042731
NumHalogen NumHalogen 0.157187646 -0.39646897 -0.38111965
NumHydrogen NumHydrogen 0.022654223 -0.15051320 -0.25592586
NumMultBonds NumMultBonds 0.230799468 -0.48041593 -0.47971353
NumNitrogen NumNitrogen 0.026032871 0.16134705 0.10078218
NumNonHAtoms NumNonHAtoms 0.340616555 -0.58362364 -0.62965400
NumNonHBonds NumNonHBonds 0.342455243 -0.58519676 -0.63228366
NumOxygen NumOxygen 0.045245139 0.21270905 0.14954994
NumRings NumRings 0.231183499 -0.48081545 -0.50941815
NumRotBonds NumRotBonds 0.013147325 -0.11466178 -0.14976036
NumSulfer NumSulfer 0.005865198 -0.07658458 -0.12090249
SurfaceArea1 SurfaceArea1 0.192535120 0.30325216 0.19339720
SurfaceArea2 SurfaceArea2 0.216936613 0.26663995 0.14057885
MIC Relief
HydrophilicFactor 0.3208456 0.140185965
MolWeight 0.4679277 0.084734907
NumAromaticBonds 0.2705170 0.050013692
NumAtoms 0.2896815 0.008618179
NumBonds 0.3268683 0.002422405
NumCarbon 0.4434121 0.061605610
NumChlorine 0.2011708 0.023813283
NumDblBonds 0.1688472 0.056997492
NumHalogen 0.2017841 0.045002621
NumHydrogen 0.1939521 0.075626122
NumMultBonds 0.2792600 0.051554380
NumNitrogen 0.1535738 0.168280773
NumNonHAtoms 0.3947092 0.036433860
NumNonHBonds 0.3919627 0.035619406
NumOxygen 0.1527421 0.123797003
NumRings 0.3161828 0.056263469
NumRotBonds 0.1754215 0.043556286
NumSulfer 0.1297052 0.062359034
SurfaceArea1 0.2054896 0.120727945
SurfaceArea2 0.2274047 0.117632188
%%R
print(
featurePlot(solTrainXtrans[, c("NumCarbon", "SurfaceArea2")],
solTrainY,
between = list(x = 1),
type = c("g", "p", "smooth"),
df = 3,
aspect = 1,
labels = c("", "Solubility"))
)
print(
splom(~contDescrSplomData[,c(3, 4, 2, 5)],
groups = contDescrSplomData$Group,
varnames = c("Correlation", "Rank\nCorrelation", "LOESS", "MIC"))
)
%%R
## Now look at the categorical (i.e. binary) predictors
SolCatPred <- solTrainXtrans[, grepl("FP", names(solTrainXtrans))]
SolCatPred$Sol <- solTrainY
numSolCatPred <- ncol(SolCatPred) - 1
tests <- apply(SolCatPred[, 1:numSolCatPred], 2,
function(x, y)
{
tStats <- t.test(y ~ x)[c("statistic", "p.value", "estimate")]
unlist(tStats)
},
y = solTrainY)
## The results are a matrix with predictors in columns. We reverse this
tests <- as.data.frame(t(tests))
names(tests) <- c("t.Statistic", "t.test_p.value", "mean0", "mean1")
tests$difference <- tests$mean1 - tests$mean0
print(
tests
)
t.Statistic t.test_p.value mean0 mean1 difference FP001 -4.02204024 6.287404e-05 -2.978465 -2.451471 0.526993515 FP002 10.28672686 1.351580e-23 -2.021347 -3.313860 -1.292512617 FP003 -2.03644225 4.198619e-02 -2.832164 -2.571855 0.260308757 FP004 -4.94895770 9.551772e-07 -3.128380 -2.427428 0.700951689 FP005 10.28247538 1.576549e-23 -1.969000 -3.262722 -1.293722323 FP006 -7.87583806 9.287835e-15 -3.109421 -2.133832 0.975589032 FP007 -0.88733923 3.751398e-01 -2.759967 -2.646185 0.113781971 FP008 3.32843788 9.119521e-04 -2.582652 -2.999613 -0.416960797 FP009 11.49360533 7.467714e-27 -2.249591 -3.926278 -1.676686955 FP010 -4.11392307 4.973603e-05 -2.824302 -2.232824 0.591478647 FP011 -7.01680213 1.067782e-11 -2.934645 -1.927353 1.007292306 FP012 -1.89255407 5.953582e-02 -2.773755 -2.461369 0.312385742 FP013 11.73267872 1.088092e-24 -2.365485 -4.490696 -2.125210704 FP014 11.47456176 1.157457e-23 -2.375401 -4.508431 -2.133030370 FP015 -7.73718733 1.432769e-12 -4.404286 -2.444487 1.959799162 FP016 -0.61719794 5.377695e-01 -2.733559 -2.631007 0.102551919 FP017 2.73915987 6.681864e-03 -2.654607 -3.098613 -0.444006259 FP018 4.26743510 2.806561e-05 -2.643402 -3.215280 -0.571878063 FP019 -2.31045847 2.207143e-02 -2.766910 -2.370603 0.396306731 FP020 -3.44119896 7.251032e-04 -2.785806 -2.224912 0.560894171 FP021 3.35165112 1.009498e-03 -2.642392 -3.272348 -0.629955482 FP022 -0.66772403 5.051252e-01 -2.728040 -2.637071 0.090969199 FP023 2.18958532 2.989162e-02 -2.673106 -3.042650 -0.369544057 FP024 -2.43189276 1.617811e-02 -2.766457 -2.340841 0.425616224 FP025 -2.68651403 7.981132e-03 -2.771677 -2.312545 0.459131121 FP026 0.58596455 5.591541e-01 -2.709082 -2.821875 -0.112793485 FP027 -4.46177875 1.714807e-05 -2.793800 -2.024516 0.769283405 FP028 -3.36478123 1.011310e-03 -2.791941 -2.101089 0.690852068 FP029 1.50309317 1.346711e-01 -2.696475 -2.913093 -0.216617374 FP030 -4.18564626 5.684141e-05 -2.799582 -1.933933 0.865649782 FP031 -0.19030898 8.494207e-01 -2.721986 -2.683765 0.038221437 FP032 -2.86824205 5.100440e-03 -2.757832 -2.224429 0.533403438 FP033 -2.48343886 1.492327e-02 -2.751062 -2.282879 0.468183359 FP034 0.81786492 4.147985e-01 -2.709737 -2.820263 -0.110526015 FP035 4.17698556 6.851675e-05 -2.659660 -3.471594 -0.811934339 FP036 -5.31186085 6.344823e-07 -2.787224 -1.880417 0.906807452 FP037 1.37213471 1.734895e-01 -2.700271 -2.960000 -0.259728507 FP038 -2.55044552 1.224045e-02 -2.764833 -2.228293 0.536540459 FP039 6.83856010 1.396591e-09 -2.588330 -4.332817 -1.744487356 FP040 -4.96957478 3.640553e-06 -2.788036 -1.771692 1.016343810 FP041 3.86443922 2.274448e-04 -2.672424 -3.403833 -0.731409091 FP042 -1.10149897 2.742144e-01 -2.729509 -2.536852 0.192657624 FP043 -0.18525729 8.535189e-01 -2.721284 -2.680317 0.040966323 FP044 15.19844350 1.458342e-22 -2.472237 -6.582105 -4.109868127 FP045 3.26197779 1.781037e-03 -2.678118 -3.403962 -0.725844224 FP046 7.19096539 1.949765e-12 -2.405146 -3.398700 -0.993554071 FP047 3.08813847 2.106659e-03 -2.611605 -3.013676 -0.402071305 FP048 0.78156187 4.354510e-01 -2.703337 -2.826102 -0.122764360 FP049 9.32620107 1.541509e-16 -2.494036 -4.334828 -1.840791658 FP050 1.78989997 7.537562e-02 -2.684810 -2.984860 -0.300049387 FP051 3.85923300 1.590148e-04 -2.656482 -3.224231 -0.567749069 FP052 -1.37622794 1.707261e-01 -2.736296 -2.542529 0.193767561 FP053 7.79872544 3.863769e-12 -2.565418 -4.201910 -1.636492479 FP054 4.71268264 7.815108e-06 -2.656678 -3.474167 -0.817488623 FP055 -2.15047129 3.539774e-02 -2.743122 -2.285294 0.457828105 FP056 6.56517336 8.289424e-09 -2.598841 -4.435323 -1.836481186 FP057 1.55970276 1.207241e-01 -2.686667 -2.952807 -0.266140351 FP058 1.31266618 1.913070e-01 -2.691483 -2.930000 -0.238517200 FP059 5.30327181 1.388228e-06 -2.662258 -3.692115 -1.029857320 FP060 -6.34967826 3.396521e-10 -3.112819 -2.294192 0.818627333 FP061 -3.23528852 1.258017e-03 -2.903859 -2.489247 0.414612257 FP062 -4.68040368 3.284921e-06 -2.978056 -2.384856 0.593200306 FP063 -5.90647947 4.865776e-09 -3.037509 -2.288593 0.748916565 FP064 -3.19849081 1.427257e-03 -2.887640 -2.481616 0.406023478 FP065 13.67947483 7.369864e-39 -1.740827 -3.389468 -1.648641212 FP066 -3.50425986 4.936856e-04 -3.034043 -2.516776 0.517267265 FP067 -3.71025855 2.192910e-04 -2.894797 -2.430554 0.464242594 FP068 -4.50468714 7.534223e-06 -2.923921 -2.356221 0.567699992 FP069 -1.39582672 1.631126e-01 -2.782438 -2.605872 0.176566128 FP070 11.33500604 6.532630e-27 -2.155840 -3.739142 -1.583301881 FP071 9.16039412 1.012284e-18 -2.295828 -3.588521 -1.292692775 FP072 -9.86673490 4.502526e-21 -3.674277 -2.222396 1.451880757 FP073 -6.31556184 4.773987e-10 -2.972104 -2.154780 0.817323998 FP074 -3.16365915 1.617158e-03 -2.849299 -2.446958 0.402341137 FP075 -4.83159241 1.618286e-06 -2.926916 -2.311584 0.615331888 FP076 18.19671006 2.170836e-57 -1.949953 -4.292756 -2.342803359 FP077 -0.24434665 8.070283e-01 -2.728715 -2.697082 0.031633203 FP078 -0.49694487 6.193690e-01 -2.737523 -2.675156 0.062366949 FP079 12.46647477 2.609452e-32 -1.649763 -3.199207 -1.549444605 FP080 -4.44534892 1.029202e-05 -2.896848 -2.308160 0.588687940 FP081 0.11125946 9.114457e-01 -2.714519 -2.729057 -0.014537653 FP082 12.55490234 3.329065e-32 -1.573824 -3.177143 -1.603319328 FP083 -6.28835488 5.760827e-10 -2.932735 -2.149385 0.783350551 FP084 -3.43524930 6.332047e-04 -2.851414 -2.386949 0.464465314 FP085 10.47209331 1.134762e-22 -2.307585 -3.916008 -1.608423485 FP086 1.02088695 3.077271e-01 -2.682101 -2.817578 -0.135477406 FP087 11.07193302 5.850147e-26 -1.684808 -3.107540 -1.422732105 FP088 -4.82078133 1.873320e-06 -2.891398 -2.233960 0.657438003 FP089 15.68684642 7.559612e-42 -2.131606 -4.506936 -2.375330025 FP090 0.72850761 4.666345e-01 -2.693950 -2.792743 -0.098793036 FP091 -1.97821299 4.847758e-02 -2.777626 -2.515187 0.262438593 FP092 12.71461669 9.160201e-31 -2.250250 -4.169957 -1.919706549 FP093 2.40580805 1.652056e-02 -2.636787 -2.972026 -0.335238658 FP094 -1.08529331 2.783195e-01 -2.751874 -2.607909 0.143965054 FP095 -4.83150303 1.885749e-06 -2.863571 -2.203780 0.659791524 FP096 -0.05816460 9.536450e-01 -2.720323 -2.712271 0.008052049 FP097 9.06740092 4.508890e-18 -2.420977 -3.684420 -1.263443027 FP098 -3.09495737 2.088014e-03 -2.820538 -2.391460 0.429077754 FP099 4.51553294 8.153915e-06 -2.575959 -3.203843 -0.627883409 FP100 -4.26730797 2.354655e-05 -2.846430 -2.293727 0.552702276 FP101 -3.33565277 9.211008e-04 -2.828760 -2.363022 0.465738108 FP102 1.25032500 2.119440e-01 -2.683373 -2.857708 -0.174335474 FP103 2.51185846 1.236590e-02 -2.644038 -2.984808 -0.340770007 FP104 1.23433987 2.176989e-01 -2.681746 -2.846934 -0.165188360 FP105 2.56644125 1.063908e-02 -2.640201 -3.003756 -0.363555025 FP106 2.42187970 1.595574e-02 -2.652367 -2.998297 -0.345929993 FP107 10.92623859 2.395320e-23 -2.328707 -4.173284 -1.844576915 FP108 -0.88386799 3.773218e-01 -2.744087 -2.619641 0.124446276 FP109 1.72666429 8.493856e-02 -2.681392 -2.891845 -0.210453156 FP110 -4.30633122 2.083157e-05 -2.839272 -2.253622 0.585649074 FP111 0.07891212 9.371465e-01 -2.716361 -2.727594 -0.011232326 FP112 13.31169435 4.090297e-31 -2.293512 -4.478541 -2.185028791 FP113 -4.25438885 2.743420e-05 -2.842824 -2.207527 0.635296648 FP114 0.38442341 7.009005e-01 -2.711034 -2.759459 -0.048425836 FP115 -0.49398272 6.216320e-01 -2.730653 -2.663059 0.067594185 FP116 -3.39726200 7.657795e-04 -2.815911 -2.310055 0.505856814 FP117 3.16005628 1.769096e-03 -2.623060 -3.157353 -0.534292762 FP118 -3.88255786 1.272871e-04 -2.835755 -2.226776 0.608979252 FP119 -0.71996857 4.720764e-01 -2.734485 -2.636839 0.097646215 FP120 -3.25854728 1.280523e-03 -2.807793 -2.270759 0.537033697 FP121 0.62156119 5.349141e-01 -2.704487 -2.805188 -0.100701417 FP122 -2.44169102 1.530759e-02 -2.781836 -2.396154 0.385682632 FP123 3.52755166 4.929055e-04 -2.628914 -3.165157 -0.536243091 FP124 -3.58983366 3.953044e-04 -2.806888 -2.261494 0.545394825 FP125 -2.91655379 3.853055e-03 -2.786364 -2.350743 0.435620393 FP126 -1.44180023 1.505173e-01 -2.748395 -2.547234 0.201161019 FP127 -2.66597987 8.213408e-03 -2.773386 -2.381429 0.391957737 FP128 -3.37747584 8.536233e-04 -2.794086 -2.284752 0.509334647 FP129 3.28855844 1.192299e-03 -2.642100 -3.193030 -0.550930181 FP130 1.02990587 3.048783e-01 -2.698555 -2.888900 -0.190345358 FP131 -0.49682548 6.198471e-01 -2.727954 -2.653583 0.074370939 FP132 -5.89680424 1.633112e-08 -2.832055 -1.925126 0.906929238 FP133 -1.83896087 6.756107e-02 -2.757100 -2.451750 0.305349880 FP134 3.16620016 1.761695e-03 -2.661506 -3.110000 -0.448493976 FP135 -2.94236705 3.709259e-03 -2.783827 -2.266667 0.517160048 FP136 -2.02006233 4.501990e-02 -2.761938 -2.403304 0.358633451 FP137 -0.07855180 9.374873e-01 -2.720131 -2.706636 0.013494433 FP138 -1.44829927 1.496787e-01 -2.748083 -2.483302 0.264780953 FP139 -0.22212826 8.246439e-01 -2.721936 -2.680897 0.041038417 FP140 -1.86990507 6.355486e-02 -2.758036 -2.403962 0.354073239 FP141 4.15441700 4.792655e-05 -2.650655 -3.232523 -0.581867761 FP142 -2.92307611 4.047862e-03 -2.779233 -2.224519 0.554713355 FP143 0.83414756 4.061300e-01 -2.705904 -2.862338 -0.156433772 FP144 -4.98991305 1.904653e-06 -2.819214 -1.852424 0.966789373 FP145 -3.99831545 1.002597e-04 -2.787077 -2.128990 0.658087566 FP146 6.08904552 1.064009e-08 -2.608687 -3.675000 -1.066313013 FP147 -2.98364059 3.376138e-03 -2.776357 -2.226800 0.549557227 FP148 -4.00444775 1.101041e-04 -2.780300 -2.073012 0.707287491 FP149 9.67498002 8.530838e-16 -2.479225 -5.125930 -2.646704799 FP150 -1.59224059 1.145443e-01 -2.742808 -2.435467 0.307341553 FP151 -1.68674372 9.608846e-02 -2.736013 -2.423019 0.312994495 FP152 2.02103329 4.549820e-02 -2.692325 -3.012308 -0.319982377 FP153 0.83775227 4.044086e-01 -2.703900 -2.892432 -0.188532775 FP154 -0.18701160 8.526043e-01 -2.720525 -2.668889 0.051635701 FP155 4.93743429 3.813516e-06 -2.653412 -3.592273 -0.938860298 FP156 2.70254904 8.178498e-03 -2.685045 -3.160896 -0.475850274 FP157 -1.19798365 2.351567e-01 -2.738105 -2.423220 0.314885042 FP158 -3.18371959 2.293303e-03 -2.757078 -2.039020 0.718058170 FP159 2.90626659 4.444806e-03 -2.687590 -3.127313 -0.439722935 FP160 0.72930617 4.673596e-01 -2.711400 -2.816308 -0.104908144 FP161 -8.02084404 8.158474e-12 -2.826779 -1.193333 1.633445946 FP162 9.05654884 7.502729e-19 -2.147208 -3.300849 -1.153640924 FP163 -4.73411111 2.565152e-06 -3.009759 -2.398455 0.611304290 FP164 11.15556043 6.131703e-27 -1.830706 -3.245042 -1.414335661 FP165 -3.26163144 1.150990e-03 -2.862294 -2.450602 0.411691613 FP166 6.01599552 3.059094e-09 -2.441541 -3.277905 -0.836363881 FP167 -3.77468033 1.718080e-04 -2.874742 -2.398718 0.476023835 FP168 12.78784085 6.302482e-34 -1.659686 -3.250521 -1.590835792 FP169 10.79840624 1.952902e-22 -2.370413 -4.241017 -1.870603512 FP170 1.45059296 1.480425e-01 -2.674961 -2.911943 -0.236981517 FP171 -3.56151646 4.354270e-04 -2.810722 -2.266398 0.544324003 FP172 13.04070659 8.112523e-28 -2.345390 -4.809931 -2.464540221 FP173 2.68918003 7.770466e-03 -2.653554 -3.111556 -0.458001634 FP174 0.94721964 3.446525e-01 -2.699492 -2.845806 -0.146314311 FP175 0.01020115 9.918704e-01 -2.718360 -2.719922 -0.001562215 FP176 -2.29447613 2.298911e-02 -2.766395 -2.374310 0.392084865 FP177 -1.08253877 2.802959e-01 -2.737548 -2.580609 0.156939151 FP178 3.27582610 1.258481e-03 -2.656782 -3.167739 -0.510956834 FP179 0.85670987 3.931634e-01 -2.703846 -2.854409 -0.150562448 FP180 -2.83913345 5.188161e-03 -2.773274 -2.263235 0.510039146 FP181 6.24259165 6.005980e-09 -2.617726 -3.695281 -1.077554681 FP182 -2.11887211 3.595632e-02 -2.755239 -2.384255 0.370983887 FP183 -2.62186301 1.015591e-02 -2.755210 -2.271250 0.483960466 FP184 10.24979020 9.572172e-17 -2.493318 -5.171000 -2.677681975 FP185 3.21519455 1.718715e-03 -2.667230 -3.270000 -0.602770115 FP186 -2.10893733 3.756740e-02 -2.749818 -2.342740 0.407078042 FP187 -0.14233858 8.871705e-01 -2.721122 -2.685942 0.035180420 FP188 -2.76497219 7.083803e-03 -2.760011 -2.153692 0.606318979 FP189 0.29230393 7.707177e-01 -2.713884 -2.774932 -0.061047680 FP190 8.23796541 2.799252e-12 -2.574785 -4.556522 -1.981737159 FP191 -1.62000293 1.089976e-01 -2.742364 -2.404627 0.337737388 FP192 0.55100083 5.833593e-01 -2.711377 -2.829310 -0.117932965 FP193 11.06173597 1.595927e-16 -2.525146 -5.642881 -3.117735616 FP194 -1.03294441 3.047671e-01 -2.728916 -2.553214 0.175701915 FP195 -5.88072667 1.035398e-07 -2.786495 -1.672759 1.113736340 FP196 6.42707826 1.269199e-08 -2.651126 -3.838889 -1.187762913 FP197 3.82944792 3.167065e-04 -2.670555 -3.583800 -0.913245061 FP198 -3.87872401 2.598433e-04 -2.776165 -1.761852 1.014313143 FP199 0.59118217 5.569865e-01 -2.711578 -2.859333 -0.147754967 FP200 5.15622561 3.020793e-06 -2.668319 -3.685106 -1.016787799 FP201 -3.92629512 2.100852e-04 -2.757414 -2.018600 0.738813984 FP202 5.92935333 6.082278e-09 -2.496969 -3.357143 -0.860174019 FP203 1.09341446 2.759667e-01 -2.695582 -2.896147 -0.200564841 FP204 2.86078975 4.868444e-03 -2.672159 -3.141702 -0.469543435 FP205 5.61427744 2.488511e-07 -2.605564 -4.057838 -1.452273414 FP206 3.58353985 6.162975e-04 -2.674519 -3.409474 -0.734954669 FP207 8.34894566 1.153650e-11 -2.595151 -4.768704 -2.173553202 FP208 1.37823055 1.702203e-01 -2.690237 -2.942056 -0.251819108
%%R
## Create a volcano plot
print(
xyplot(-log10(t.test_p.value) ~ difference,
data = tests,
xlab = "Mean With Structure - Mean Without Structure",
ylab = "-log(p-Value)",
type = "p")
)
%%R
### Section 18.2 Categorical Outcomes
## Load the segmentation data
data(segmentationData)
segTrain <- subset(segmentationData, Case == "Train")
segTrain$Case <- segTrain$Cell <- NULL
segTest <- subset(segmentationData, Case != "Train")
segTest$Case <- segTest$Cell <- NULL
## Compute the areas under the ROC curve
aucVals <- filterVarImp(x = segTrain[, -1], y = segTrain$Class)
aucVals$Predictor <- rownames(aucVals)
## Cacluate the t-tests as before but with x and y switched
segTests <- apply(segTrain[, -1], 2,
function(x, y)
{
tStats <- t.test(x ~ y)[c("statistic", "p.value", "estimate")]
unlist(tStats)
},
y = segTrain$Class)
segTests <- as.data.frame(t(segTests))
names(segTests) <- c("t.Statistic", "t.test_p.value", "mean0", "mean1")
segTests$Predictor <- rownames(segTests)
## Fit a random forest model and get the importance scores
library(randomForest)
set.seed(791)
rfImp <- randomForest(Class ~ ., data = segTrain,
ntree = 2000,
importance = TRUE)
rfValues <- data.frame(RF = importance(rfImp)[, "MeanDecreaseGini"],
Predictor = rownames(importance(rfImp)))
## Now compute the Relief scores
set.seed(791)
ReliefValues <- attrEval(Class ~ ., data = segTrain,
estimator="ReliefFequalK", ReliefIterations = 50)
ReliefValues <- data.frame(Relief = ReliefValues,
Predictor = names(ReliefValues))
## and the MIC statistics
set.seed(791)
segMIC <- mine(x = segTrain[, -1],
## Pass the outcome as 0/1
y = ifelse(segTrain$Class == "PS", 1, 0))$MIC
segMIC <- data.frame(Predictor = rownames(segMIC),
MIC = segMIC[,1])
rankings <- merge(segMIC, ReliefValues)
rankings <- merge(rankings, rfValues)
rankings <- merge(rankings, segTests)
rankings <- merge(rankings, aucVals)
print(
rankings
)
Predictor MIC Relief RF t.Statistic 1 AngleCh1 0.131057008 0.002287557 4.730963 -0.21869850 2 AreaCh1 0.108083908 0.016041257 4.315317 -0.93160658 3 AvgIntenCh1 0.292046076 0.071057681 18.865802 -11.75400848 4 AvgIntenCh2 0.329484594 0.150684824 21.857848 -16.09400822 5 AvgIntenCh3 0.135443794 0.018172519 5.135363 -0.14752973 6 AvgIntenCh4 0.166545039 -0.007167866 5.434737 -6.23725001 7 ConvexHullAreaRatioCh1 0.299627157 0.035983697 19.093048 14.22756193 8 ConvexHullPerimRatioCh1 0.254931744 0.041865999 12.624038 -13.86697029 9 DiffIntenDensityCh1 0.239224382 0.038582763 7.335741 -9.81721615 10 DiffIntenDensityCh3 0.133084659 0.010830941 6.647198 1.48785690 11 DiffIntenDensityCh4 0.147643832 0.042352546 5.386981 -5.54840221 12 EntropyIntenCh1 0.261097110 0.129280729 13.867582 -14.04326173 13 EntropyIntenCh3 0.172122729 0.039687246 5.127465 6.94689541 14 EntropyIntenCh4 0.185625627 0.021260676 5.742739 -9.03621024 15 EqCircDiamCh1 0.108083908 0.038820971 4.185607 -1.85186912 16 EqEllipseLWRCh1 0.212579943 0.016550609 5.708705 9.83868863 17 EqEllipseOblateVolCh1 0.122276159 0.010367074 3.906543 1.35616134 18 EqEllipseProlateVolCh1 0.169674904 -0.005386670 6.018121 -1.29243801 19 EqSphereAreaCh1 0.108083908 0.016110539 4.183567 -0.93273061 20 EqSphereVolCh1 0.108083908 0.003440003 4.133475 -0.04348657 21 FiberAlign2Ch3 0.177116842 -0.002628403 4.373886 3.65095007 22 FiberAlign2Ch4 0.149937844 0.016047962 4.868552 2.07009183 23 FiberLengthCh1 0.220505513 0.050610471 8.368712 9.26429955 24 FiberWidthCh1 0.368720274 0.107691201 33.371913 -18.96852051 25 IntenCoocASMCh3 0.196466490 0.024738010 7.298595 -7.95107008 26 IntenCoocASMCh4 0.147981004 0.005574684 3.734085 4.51016239 27 IntenCoocContrastCh3 0.231500707 0.021282305 8.438533 13.20540372 28 IntenCoocContrastCh4 0.135150335 -0.002605380 4.567712 1.02551789 29 IntenCoocEntropyCh3 0.202905819 0.039769279 6.354566 9.62738946 30 IntenCoocEntropyCh4 0.148928924 0.042214966 4.234247 -5.73801017 31 IntenCoocMaxCh3 0.193078547 0.039834486 6.865277 -10.01109754 32 IntenCoocMaxCh4 0.152580596 0.064488810 3.966995 5.02868895 33 KurtIntenCh1 0.200874103 0.003243188 7.095402 3.18226166 34 KurtIntenCh3 0.135694293 0.010944913 4.237905 -2.46783420 35 KurtIntenCh4 0.152775633 0.011328311 5.339427 4.39807449 36 LengthCh1 0.149378763 0.044483732 4.235474 5.28480181 37 NeighborAvgDistCh1 0.123412342 0.023330722 4.266566 -0.46614250 38 NeighborMinDistCh1 0.125623472 0.007850922 5.152365 0.80769702 39 NeighborVarDistCh1 0.124259322 0.016447793 4.286239 0.29886752 40 PerimCh1 0.170013515 0.025272254 4.115593 6.18542523 41 ShapeBFRCh1 0.235667275 0.005194794 9.782458 -13.25311412 42 ShapeLWRCh1 0.183599199 0.029568271 4.745873 8.40241429 43 ShapeP2ACh1 0.332238080 0.073795605 19.362332 14.75801555 44 SkewIntenCh1 0.259680600 0.085229983 13.628434 9.66411304 45 SkewIntenCh3 0.149153858 0.056669970 4.244103 -3.76453794 46 SkewIntenCh4 0.152202895 0.002508761 5.478398 6.46619794 47 SpotFiberCountCh3 0.005721744 -0.005692308 1.793200 -0.53238018 48 SpotFiberCountCh4 0.019496167 -0.015192982 2.948225 2.98634139 49 TotalIntenCh1 0.304429766 0.045548534 20.916993 -8.20041297 50 TotalIntenCh2 0.400952572 0.185416030 41.617068 -14.54087193 51 TotalIntenCh3 0.115771733 0.015068883 5.402005 -0.46828755 52 TotalIntenCh4 0.186643156 0.006071748 5.712561 -5.64791505 53 VarIntenCh1 0.241235863 0.045687478 9.259561 -10.40110966 54 VarIntenCh3 0.150238051 0.002815999 5.176123 -2.44172596 55 VarIntenCh4 0.171222193 0.001547820 5.981325 -4.83455579 56 WidthCh1 0.146204548 0.021560423 5.113884 -1.59227638 57 XCentroid 0.106662637 -0.037877551 4.220162 1.10633278 58 YCentroid 0.119516938 0.055209622 4.908536 2.19081435 t.test_p.value mean0 mean1 PS WS 1 8.269443e-01 9.086539e+01 9.157148e+01 0.5025967 0.5025967 2 3.517830e-01 3.205519e+02 3.329249e+02 0.5709170 0.5709170 3 4.819837e-28 7.702212e+01 2.146922e+02 0.7662375 0.7662375 4 2.530403e-50 1.324405e+02 2.778397e+02 0.7866146 0.7866146 5 8.827553e-01 9.578766e+01 9.671147e+01 0.5214098 0.5214098 6 7.976250e-10 1.168287e+02 1.795797e+02 0.6473814 0.6473814 7 5.895088e-42 1.270408e+00 1.114054e+00 0.7815519 0.7815519 8 4.644231e-40 8.714806e-01 9.310403e-01 0.7547844 0.7547844 9 6.509740e-21 6.055821e+01 9.601373e+01 0.7161591 0.7161591 10 1.371842e-01 7.753072e+01 7.104993e+01 0.5427353 0.5427353 11 4.178896e-08 7.508542e+01 1.061125e+02 0.6294704 0.6294704 12 5.145995e-40 6.364841e+00 7.004622e+00 0.7565169 0.7565169 13 8.836060e-12 5.704662e+00 5.014508e+00 0.6340145 0.6340145 14 9.775620e-19 5.192365e+00 6.023039e+00 0.6661861 0.6661861 15 6.437960e-02 1.940093e+01 2.002646e+01 0.5709170 0.5709170 16 7.218411e-22 2.371177e+00 1.758240e+00 0.6965915 0.6965915 17 1.753561e-01 7.632288e+02 6.866693e+02 0.5045568 0.5045568 18 1.965213e-01 3.543481e+02 3.920429e+02 0.6301870 0.6301870 19 3.512025e-01 1.284179e+03 1.333731e+03 0.5709170 0.5709170 20 9.653226e-01 5.017110e+03 5.033648e+03 0.5709170 0.5709170 21 2.770065e-04 1.479185e+00 1.421565e+00 0.5690728 0.5690728 22 3.873106e-02 1.444148e+00 1.412867e+00 0.5421535 0.5421535 23 1.239044e-19 3.991835e+01 2.819142e+01 0.7007984 0.7007984 24 1.162284e-64 8.691444e+00 1.282684e+01 0.8355127 0.8355127 25 1.067683e-14 7.373161e-02 1.559897e-01 0.6956093 0.6956093 26 7.290850e-06 1.131789e-01 7.724074e-02 0.5878438 0.5878438 27 7.794899e-37 1.163875e+01 6.292079e+00 0.7214199 0.7214199 28 3.053656e-01 7.700191e+00 7.343397e+00 0.5358642 0.5358642 29 1.282007e-20 6.201308e+00 5.216667e+00 0.6891345 0.6891345 30 1.313352e-08 5.545934e+00 6.032306e+00 0.6073356 0.6073356 31 4.418432e-22 1.900393e-01 3.245564e-01 0.6944627 0.6944627 32 5.990072e-07 2.707207e-01 2.131262e-01 0.5892938 0.5892938 33 1.506054e-03 1.208829e+00 3.868323e-01 0.6711982 0.6711982 34 1.388162e-02 3.121647e+00 4.480168e+00 0.5513936 0.5513936 35 1.210957e-05 1.388322e+00 2.421078e-01 0.6046335 0.6046335 36 1.571520e-07 3.237304e+01 2.839838e+01 0.6015142 0.6015142 37 6.412508e-01 2.294382e+02 2.307292e+02 0.5047676 0.5047676 38 4.194740e-01 3.020875e+01 2.962558e+01 0.5018274 0.5018274 39 7.651196e-01 1.046047e+02 1.042038e+02 0.5072546 0.5072546 40 9.075622e-10 9.721959e+01 8.203652e+01 0.6200196 0.6200196 41 6.819382e-37 5.630603e-01 6.406694e-01 0.7319836 0.7319836 42 1.498789e-16 1.968091e+00 1.601640e+00 0.6607778 0.6607778 43 9.265729e-45 2.380621e+00 1.606325e+00 0.7930978 0.7930978 44 6.631564e-21 8.687084e-01 4.124373e-01 0.7253275 0.7253275 45 1.819323e-04 1.429871e+00 1.711829e+00 0.5732881 0.5732881 46 1.592246e-10 1.069003e+00 7.366442e-01 0.6193873 0.6193873 47 5.946089e-01 1.915094e+00 1.970509e+00 0.5173630 0.5173630 48 2.894728e-03 7.224843e+00 6.477212e+00 0.4619775 0.4619775 49 1.624963e-15 2.494150e+04 6.265354e+04 0.7895358 0.7895358 50 3.385024e-43 3.858694e+04 7.665351e+04 0.8012840 0.8012840 51 6.397155e-01 2.685926e+04 2.770986e+04 0.5094972 0.5094972 52 2.290183e-08 3.466429e+04 5.217025e+04 0.6599073 0.6599073 53 5.662429e-23 5.142099e+01 1.136596e+02 0.7322365 0.7322365 54 1.488950e-02 9.519852e+01 1.127093e+02 0.5330821 0.5330821 55 1.632212e-06 1.063653e+02 1.430475e+02 0.6322357 0.6322357 56 1.116486e-01 1.754162e+01 1.813792e+01 0.5799484 0.5799484 57 2.689098e-01 2.698852e+02 2.599759e+02 0.5216669 0.5216669 58 2.875168e-02 1.842972e+02 1.691475e+02 0.5407878 0.5407878
%%R
rankings$channel <- "Channel 1"
rankings$channel[grepl("Ch2$", rankings$Predictor)] <- "Channel 2"
rankings$channel[grepl("Ch3$", rankings$Predictor)] <- "Channel 3"
rankings$channel[grepl("Ch4$", rankings$Predictor)] <- "Channel 4"
rankings$t.Statistic <- abs(rankings$t.Statistic)
print(
splom(~rankings[, c("PS", "t.Statistic", "RF", "Relief", "MIC")],
groups = rankings$channel,
varnames = c("ROC\nAUC", "Abs\nt-Stat", "Random\nForest", "Relief", "MIC"),
auto.key = list(columns = 2))
)
%%R
## Load the grant data. A script to create and save these data is contained
## in the same directory as this file.
source( file.path( scriptLocation(), "CreateGrantData.R" ), echo=TRUE )
load("grantData.RData")
dataSubset <- training[pre2008, c("Sponsor62B", "ContractValueBandUnk", "RFCD240302")]
## This is a simple function to compute several statistics for binary predictors
tableCalcs <- function(x, y)
{
tab <- table(x, y)
fet <- fisher.test(tab)
out <- c(OR = fet$estimate,
P = fet$p.value,
Gain = attrEval(y ~ x, estimator = "GainRatio"))
}
## lapply() is used to execute the function on each column
tableResults <- lapply(dataSubset, tableCalcs, y = training[pre2008, "Class"])
## The results come back as a list of vectors, and "rbind" is used to join
## then together as rows of a table
tableResults <- do.call("rbind", tableResults)
print(
tableResults
)
## The permuted Relief scores can be computed using a function from the
## AppliedPredictiveModeling package.
permuted <- permuteRelief(x = training[pre2008, c("Sponsor62B", "Day", "NumCI")],
y = training[pre2008, "Class"],
nperm = 500,
### the remaining options are passed to attrEval()
estimator="ReliefFequalK",
ReliefIterations= 50)
## The original Relief scores:
print(
permuted$observed
)
## The number of standard deviations away from the permuted mean:
print(
permuted$standardized
)
## The distributions of the scores if there were no relationship between the
## predictors and outcomes
print(
histogram(~value|Predictor,
data = permuted$permutations,
xlim = extendrange(permuted$permutations$value),
xlab = "Relief Score")
)
> ################################################################################
> ### R code from Applied Predictive Modeling (2013) by Kuhn and Jo .... [TRUNCATED]
> library(caret)
> library(lubridate)
Attaching package: 'lubridate'
The following object is masked from 'package:plyr':
here
> ## How many cores on the machine should be used for the data
> ## processing. Making cores > 1 will speed things up (depending on your
> ## machine) .... [TRUNCATED]
Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
cannot open file 'unimelb_training.csv': No such file or directory
Error in file(file, "rt") : cannot open the connection
%%R
showChapterScript(19)
NULL
%%R
showChapterOutput(19)
R Information
R version 3.0.1 (2013-05-16) -- "Good Sport"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin10.8.0 (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> ################################################################################
> ### R code from Applied Predictive Modeling (2013) by Kuhn and Johnson.
> ### Copyright 2013 Kuhn and Johnson
> ### Web Page: http://www.appliedpredictivemodeling.com
> ### Contact: Max Kuhn (mxkuhn@gmail.com)
> ###
> ### Chapter 19: An Introduction to Feature Selection
> ###
> ### Required packages: AppliedPredictiveModeling, caret, MASS, corrplot,
> ### RColorBrewer, randomForest, kernlab, klaR,
> ###
> ###
> ### Data used: The Alzheimer disease data from the AppliedPredictiveModeling
> ### package
> ###
> ### Notes:
> ### 1) This code is provided without warranty.
> ###
> ### 2) This code should help the user reproduce the results in the
> ### text. There will be differences between this code and what is is
> ### the computing section. For example, the computing sections show
> ### how the source functions work (e.g. randomForest() or plsr()),
> ### which were not directly used when creating the book. Also, there may be
> ### syntax differences that occur over time as packages evolve. These files
> ### will reflect those changes.
> ###
> ### 3) In some cases, the calculations in the book were run in
> ### parallel. The sub-processes may reset the random number seed.
> ### Your results may slightly vary.
> ###
> ################################################################################
>
>
>
> ################################################################################
> ### Section 19.6 Case Study: Predicting Cognitive Impairment
>
>
> library(AppliedPredictiveModeling)
> data(AlzheimerDisease)
>
> ## The baseline set of predictors
> bl <- c("Genotype", "age", "tau", "p_tau", "Ab_42", "male")
>
> ## The set of new assays
> newAssays <- colnames(predictors)
> newAssays <- newAssays[!(newAssays %in% c("Class", bl))]
>
> ## Decompose the genotype factor into binary dummy variables
>
> predictors$E2 <- predictors$E3 <- predictors$E4 <- 0
> predictors$E2[grepl("2", predictors$Genotype)] <- 1
> predictors$E3[grepl("3", predictors$Genotype)] <- 1
> predictors$E4[grepl("4", predictors$Genotype)] <- 1
> genotype <- predictors$Genotype
>
> ## Partition the data
> library(caret)
Loading required package: lattice
Loading required package: ggplot2
> set.seed(730)
> split <- createDataPartition(diagnosis, p = .8, list = FALSE)
>
> adData <- predictors
> adData$Class <- diagnosis
>
> training <- adData[ split, ]
> testing <- adData[-split, ]
>
> predVars <- names(adData)[!(names(adData) %in% c("Class", "Genotype"))]
>
> ## This summary function is used to evaluate the models.
> fiveStats <- function(...) c(twoClassSummary(...), defaultSummary(...))
>
> ## We create the cross-validation files as a list to use with different
> ## functions
>
> set.seed(104)
> index <- createMultiFolds(training$Class, times = 5)
>
> ## The candidate set of the number of predictors to evaluate
> varSeq <- seq(1, length(predVars)-1, by = 2)
>
> ## We can also use parallel processing to run each resampled RFE
> ## iteration (or resampled model with train()) using different
> ## workers.
>
> library(doMC)
Loading required package: foreach
Loading required package: iterators
Loading required package: parallel
> registerDoMC(15)
>
>
> ## The rfe() function in the caret package is used for recursive feature
> ## elimiation. We setup control functions for this and train() that use
> ## the same cross-validation folds. The 'ctrl' object will be modifed several
> ## times as we try different models
>
> ctrl <- rfeControl(method = "repeatedcv", repeats = 5,
+ saveDetails = TRUE,
+ index = index,
+ returnResamp = "final")
>
> fullCtrl <- trainControl(method = "repeatedcv",
+ repeats = 5,
+ summaryFunction = fiveStats,
+ classProbs = TRUE,
+ index = index)
>
> ## The correlation matrix of the new data
> predCor <- cor(training[, newAssays])
>
> library(RColorBrewer)
> cols <- c(rev(brewer.pal(7, "Blues")),
+ brewer.pal(7, "Reds"))
> library(corrplot)
> corrplot(predCor,
+ order = "hclust",
+ tl.pos = "n",addgrid.col = rgb(1,1,1,.01),
+ col = colorRampPalette(cols)(51))
>
> ## Fit a series of models with the full set of predictors
> set.seed(721)
> rfFull <- train(training[, predVars],
+ training$Class,
+ method = "rf",
+ metric = "ROC",
+ tuneGrid = data.frame(mtry = floor(sqrt(length(predVars)))),
+ ntree = 1000,
+ trControl = fullCtrl)
Loading required package: randomForest
randomForest 4.6-7
Type rfNews() to see new features/changes/bug fixes.
Loading required package: pROC
Loading required package: plyr
Type 'citation("pROC")' for a citation.
Attaching package: ‘pROC’
The following object is masked from ‘package:stats’:
cov, smooth, var
Loading required package: class
> rfFull
Random Forest
267 samples
132 predictors
2 classes: 'Impaired', 'Control'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 241, 241, 241, 240, 240, 240, ...
Resampling results
ROC Sens Spec Accuracy Kappa ROC SD Sens SD Spec SD Accuracy SD
0.89 0.45 0.985 0.838 0.506 0.0674 0.173 0.0276 0.0508
Kappa SD
0.187
Tuning parameter 'mtry' was held constant at a value of 11
>
> set.seed(721)
> ldaFull <- train(training[, predVars],
+ training$Class,
+ method = "lda",
+ metric = "ROC",
+ ## The 'tol' argument helps lda() know when a matrix is
+ ## singular. One of the predictors has values very close to
+ ## zero, so we raise the vaue to be smaller than the default
+ ## value of 1.0e-4.
+ tol = 1.0e-12,
+ trControl = fullCtrl)
Loading required package: MASS
Attaching package: ‘MASS’
The following object is masked _by_ ‘.GlobalEnv’:
genotype
> ldaFull
Linear Discriminant Analysis
267 samples
132 predictors
2 classes: 'Impaired', 'Control'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 241, 241, 241, 240, 240, 240, ...
Resampling results
ROC Sens Spec Accuracy Kappa ROC SD Sens SD Spec SD Accuracy SD
0.844 0.686 0.829 0.79 0.491 0.0859 0.18 0.0819 0.0659
Kappa SD
0.161
>
> set.seed(721)
> svmFull <- train(training[, predVars],
+ training$Class,
+ method = "svmRadial",
+ metric = "ROC",
+ tuneLength = 12,
+ preProc = c("center", "scale"),
+ trControl = fullCtrl)
Loading required package: kernlab
> svmFull
Support Vector Machines with Radial Basis Function Kernel
267 samples
132 predictors
2 classes: 'Impaired', 'Control'
Pre-processing: centered, scaled
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 241, 241, 241, 240, 240, 240, ...
Resampling results across tuning parameters:
C ROC Sens Spec Accuracy Kappa ROC SD Sens SD Spec SD
0.25 0.879 0.725 0.899 0.851 0.625 0.0806 0.157 0.0729
0.5 0.879 0.735 0.896 0.852 0.629 0.0806 0.158 0.0769
1 0.885 0.706 0.923 0.863 0.645 0.0794 0.157 0.0685
2 0.892 0.696 0.933 0.868 0.653 0.0766 0.163 0.0632
4 0.886 0.682 0.931 0.863 0.637 0.0762 0.15 0.0565
8 0.88 0.644 0.927 0.85 0.599 0.0764 0.145 0.0507
16 0.881 0.652 0.923 0.849 0.599 0.076 0.142 0.0516
32 0.881 0.652 0.928 0.853 0.607 0.076 0.142 0.0492
64 0.881 0.644 0.925 0.848 0.596 0.076 0.14 0.0518
128 0.881 0.642 0.921 0.844 0.588 0.076 0.137 0.0556
256 0.881 0.647 0.926 0.85 0.599 0.076 0.145 0.0494
512 0.881 0.644 0.924 0.847 0.593 0.076 0.145 0.0529
Accuracy SD Kappa SD
0.0679 0.167
0.067 0.163
0.0655 0.166
0.0573 0.152
0.0529 0.143
0.0498 0.137
0.0502 0.137
0.0459 0.127
0.0476 0.13
0.0491 0.131
0.0455 0.127
0.0477 0.132
Tuning parameter 'sigma' was held constant at a value of 0.004505826
ROC was used to select the optimal model using the largest value.
The final values used for the model were sigma = 0.00451 and C = 2.
>
> set.seed(721)
> nbFull <- train(training[, predVars],
+ training$Class,
+ method = "nb",
+ metric = "ROC",
+ trControl = fullCtrl)
Loading required package: klaR
> nbFull
Naive Bayes
267 samples
132 predictors
2 classes: 'Impaired', 'Control'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 241, 241, 241, 240, 240, 240, ...
Resampling results across tuning parameters:
usekernel ROC Sens Spec Accuracy Kappa ROC SD Sens SD Spec SD
FALSE 0.778 0.644 0.78 0.742 0.395 0.107 0.173 0.0931
TRUE 0.798 0.594 0.814 0.753 0.397 0.0952 0.174 0.0971
Accuracy SD Kappa SD
0.0699 0.155
0.0792 0.182
Tuning parameter 'fL' was held constant at a value of 0
ROC was used to select the optimal model using the largest value.
The final values used for the model were fL = 0 and usekernel = TRUE.
>
> lrFull <- train(training[, predVars],
+ training$Class,
+ method = "glm",
+ metric = "ROC",
+ trControl = fullCtrl)
Warning messages:
1: glm.fit: algorithm did not converge
2: glm.fit: fitted probabilities numerically 0 or 1 occurred
> lrFull
Generalized Linear Model
267 samples
132 predictors
2 classes: 'Impaired', 'Control'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 241, 241, 241, 240, 240, 240, ...
Resampling results
ROC Sens Spec Accuracy Kappa ROC SD Sens SD Spec SD Accuracy SD
0.785 0.67 0.778 0.748 0.417 0.101 0.165 0.11 0.0825
Kappa SD
0.172
>
> set.seed(721)
> knnFull <- train(training[, predVars],
+ training$Class,
+ method = "knn",
+ metric = "ROC",
+ tuneLength = 20,
+ preProc = c("center", "scale"),
+ trControl = fullCtrl)
> knnFull
k-Nearest Neighbors
267 samples
132 predictors
2 classes: 'Impaired', 'Control'
Pre-processing: centered, scaled
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 241, 241, 241, 240, 240, 240, ...
Resampling results across tuning parameters:
k ROC Sens Spec Accuracy Kappa ROC SD Sens SD Spec SD
5 0.753 0.476 0.928 0.804 0.444 0.142 0.184 0.061
7 0.76 0.455 0.94 0.807 0.445 0.136 0.157 0.0585
9 0.788 0.391 0.963 0.806 0.414 0.107 0.157 0.0374
11 0.794 0.369 0.973 0.808 0.408 0.114 0.149 0.0335
13 0.79 0.336 0.967 0.794 0.362 0.14 0.15 0.034
15 0.817 0.328 0.967 0.792 0.353 0.0753 0.152 0.0411
17 0.821 0.298 0.979 0.793 0.338 0.0736 0.157 0.0328
19 0.837 0.282 0.986 0.793 0.328 0.0704 0.168 0.0253
21 0.847 0.265 0.985 0.788 0.307 0.0704 0.169 0.0261
23 0.846 0.248 0.984 0.782 0.292 0.0673 0.121 0.03
25 0.843 0.232 0.987 0.78 0.276 0.073 0.126 0.0229
27 0.846 0.212 0.989 0.776 0.258 0.0669 0.108 0.0216
29 0.849 0.196 0.991 0.773 0.242 0.0687 0.103 0.0201
31 0.847 0.182 0.988 0.767 0.221 0.0703 0.0962 0.0268
33 0.842 0.171 0.99 0.766 0.209 0.0721 0.107 0.0208
35 0.843 0.157 0.991 0.762 0.193 0.0728 0.105 0.0201
37 0.842 0.138 0.991 0.757 0.169 0.0705 0.102 0.02
39 0.837 0.121 0.995 0.756 0.154 0.0731 0.104 0.0158
41 0.831 0.0961 0.995 0.749 0.122 0.0738 0.0932 0.0156
43 0.82 0.0739 0.996 0.744 0.0939 0.107 0.0854 0.0142
Accuracy SD Kappa SD
0.0661 0.195
0.0581 0.166
0.0541 0.177
0.0528 0.174
0.0512 0.177
0.0517 0.178
0.0494 0.183
0.05 0.191
0.0488 0.186
0.0394 0.144
0.0386 0.15
0.0364 0.135
0.0342 0.129
0.0326 0.119
0.0359 0.139
0.0352 0.136
0.0328 0.128
0.0353 0.137
0.0312 0.126
0.0288 0.116
ROC was used to select the optimal model using the largest value.
The final value used for the model was k = 29.
>
> ## Now fit the RFE versions. To do this, the 'functions' argument of the rfe()
> ## object is modified to the approproate functions. For model details about
> ## these functions and their arguments, see
> ##
> ## http://caret.r-forge.r-project.org/featureSelection.html
> ##
> ## for more information.
>
>
>
>
> ctrl$functions <- rfFuncs
> ctrl$functions$summary <- fiveStats
> set.seed(721)
> rfRFE <- rfe(training[, predVars],
+ training$Class,
+ sizes = varSeq,
+ metric = "ROC",
+ ntree = 1000,
+ rfeControl = ctrl)
> rfRFE
Recursive feature selection
Outer resampling method: Cross-Validated (10 fold, repeated 5 times)
Resampling performance over subset size:
Variables ROC Sens Spec Accuracy Kappa ROCSD SensSD SpecSD
1 0.8067 0.5418 0.8785 0.7867 0.4344 0.09395 0.1856 0.07326
3 0.8590 0.6518 0.9185 0.8457 0.5929 0.08670 0.1705 0.06688
5 0.8872 0.6521 0.9468 0.8661 0.6355 0.08310 0.1743 0.06089
7 0.8870 0.6546 0.9446 0.8652 0.6320 0.11025 0.1929 0.05844
9 0.8985 0.6711 0.9549 0.8771 0.6618 0.07643 0.1890 0.04689
11 0.8956 0.6975 0.9611 0.8886 0.6954 0.10711 0.1722 0.04070
13 0.8996 0.6696 0.9650 0.8839 0.6791 0.10124 0.1659 0.04204
15 0.8964 0.6782 0.9608 0.8832 0.6771 0.10232 0.1881 0.04215
17 0.8994 0.6754 0.9619 0.8832 0.6785 0.07797 0.1706 0.04231
19 0.8965 0.6696 0.9651 0.8840 0.6779 0.07583 0.1823 0.04294
21 0.8978 0.6450 0.9702 0.8810 0.6645 0.07639 0.1824 0.03578
23 0.8965 0.6557 0.9651 0.8803 0.6662 0.07511 0.1791 0.04173
25 0.8958 0.6557 0.9702 0.8841 0.6739 0.07332 0.1786 0.03436
27 0.8965 0.6400 0.9702 0.8796 0.6599 0.07667 0.1851 0.03578
29 0.8979 0.6261 0.9733 0.8781 0.6535 0.07581 0.1809 0.04018
31 0.8974 0.6293 0.9723 0.8781 0.6535 0.07555 0.1869 0.03596
33 0.8930 0.6171 0.9713 0.8743 0.6413 0.07872 0.1890 0.03988
35 0.8920 0.6264 0.9702 0.8758 0.6483 0.07962 0.1832 0.04122
37 0.8937 0.6039 0.9682 0.8682 0.6243 0.07697 0.1925 0.04601
39 0.8939 0.6014 0.9713 0.8697 0.6277 0.07408 0.1859 0.04277
41 0.8925 0.5904 0.9712 0.8668 0.6185 0.09554 0.1772 0.04048
43 0.8908 0.5875 0.9712 0.8660 0.6162 0.09824 0.1721 0.03759
45 0.8972 0.5764 0.9723 0.8637 0.6053 0.07124 0.1903 0.03746
47 0.8944 0.5850 0.9713 0.8651 0.6113 0.07291 0.1964 0.04522
49 0.8958 0.5696 0.9723 0.8616 0.5983 0.07301 0.1975 0.04412
51 0.8955 0.5554 0.9713 0.8570 0.5839 0.07303 0.1924 0.04265
53 0.8924 0.5575 0.9702 0.8570 0.5852 0.07102 0.1884 0.04388
55 0.8935 0.5439 0.9702 0.8532 0.5713 0.07456 0.1904 0.03765
57 0.8929 0.5196 0.9713 0.8472 0.5517 0.07414 0.1831 0.04032
59 0.8937 0.5446 0.9743 0.8562 0.5802 0.07611 0.1796 0.04076
61 0.8925 0.5411 0.9753 0.8561 0.5760 0.07465 0.1976 0.03785
63 0.8908 0.5450 0.9732 0.8556 0.5774 0.07633 0.1877 0.03788
65 0.8951 0.5411 0.9743 0.8554 0.5769 0.07304 0.1798 0.03489
67 0.8965 0.5246 0.9753 0.8517 0.5622 0.07274 0.1834 0.03474
69 0.8957 0.5196 0.9764 0.8510 0.5592 0.07228 0.1859 0.03642
71 0.8931 0.5118 0.9754 0.8481 0.5495 0.07485 0.1854 0.03303
73 0.8907 0.5061 0.9764 0.8473 0.5459 0.07456 0.1851 0.03611
75 0.8951 0.5061 0.9785 0.8488 0.5498 0.07156 0.1823 0.03111
77 0.8920 0.5004 0.9722 0.8427 0.5363 0.07632 0.1682 0.03782
79 0.8923 0.5139 0.9744 0.8481 0.5491 0.07085 0.1950 0.03900
81 0.8943 0.5061 0.9795 0.8496 0.5508 0.07134 0.1868 0.03097
83 0.8932 0.4946 0.9795 0.8465 0.5432 0.07049 0.1630 0.03103
85 0.8927 0.4832 0.9795 0.8435 0.5304 0.07044 0.1722 0.03097
87 0.8925 0.4914 0.9764 0.8436 0.5335 0.07259 0.1742 0.03483
89 0.8923 0.4696 0.9774 0.8383 0.5130 0.07230 0.1757 0.03289
91 0.8916 0.4889 0.9795 0.8450 0.5362 0.07351 0.1702 0.03448
93 0.8929 0.4693 0.9785 0.8391 0.5158 0.07389 0.1686 0.03439
95 0.8901 0.4779 0.9826 0.8443 0.5297 0.07283 0.1749 0.02858
97 0.8918 0.4743 0.9805 0.8420 0.5236 0.07111 0.1713 0.03081
99 0.8942 0.4800 0.9816 0.8443 0.5296 0.07325 0.1797 0.03055
101 0.8930 0.4800 0.9805 0.8435 0.5282 0.07164 0.1792 0.02916
103 0.8924 0.4629 0.9816 0.8397 0.5146 0.06889 0.1634 0.02864
105 0.8918 0.4575 0.9816 0.8382 0.5089 0.07070 0.1712 0.03055
107 0.8918 0.4586 0.9837 0.8398 0.5133 0.06979 0.1700 0.02607
109 0.8942 0.4746 0.9815 0.8428 0.5256 0.06719 0.1710 0.02889
111 0.8914 0.4632 0.9805 0.8390 0.5117 0.07184 0.1786 0.02916
113 0.8928 0.4518 0.9826 0.8375 0.5033 0.07029 0.1803 0.02646
115 0.8935 0.4529 0.9836 0.8383 0.5058 0.06754 0.1793 0.02614
117 0.8933 0.4464 0.9815 0.8352 0.4978 0.06930 0.1652 0.02687
119 0.8936 0.4657 0.9857 0.8435 0.5246 0.06720 0.1615 0.02522
121 0.8891 0.4682 0.9816 0.8412 0.5190 0.07080 0.1736 0.02864
123 0.8926 0.4418 0.9837 0.8353 0.4965 0.06742 0.1684 0.02796
125 0.8894 0.4436 0.9847 0.8367 0.4987 0.06941 0.1764 0.02571
127 0.8936 0.4518 0.9847 0.8390 0.5081 0.06928 0.1708 0.02571
129 0.8889 0.4468 0.9836 0.8367 0.5003 0.06845 0.1749 0.02614
131 0.8934 0.4346 0.9847 0.8344 0.4912 0.07038 0.1649 0.02571
132 0.8877 0.4379 0.9847 0.8352 0.4933 0.07298 0.1726 0.02571
AccuracySD KappaSD Selected
0.06711 0.1847
0.06934 0.1855
0.05988 0.1679
0.06634 0.1892
0.06103 0.1797
0.05451 0.1591
0.05200 0.1546 *
0.05678 0.1700
0.05461 0.1627
0.05903 0.1749
0.05494 0.1723
0.05715 0.1724
0.05202 0.1642
0.05537 0.1717
0.05820 0.1790
0.05602 0.1765
0.05933 0.1855
0.05394 0.1683
0.05929 0.1837
0.06028 0.1882
0.05597 0.1763
0.05423 0.1704
0.05549 0.1842
0.06042 0.1929
0.05635 0.1855
0.05377 0.1800
0.05516 0.1801
0.05398 0.1891
0.05712 0.1893
0.04935 0.1670
0.05440 0.1919
0.05082 0.1779
0.05206 0.1765
0.05270 0.1842
0.05400 0.1842
0.05191 0.1838
0.05377 0.1906
0.05491 0.1918
0.05021 0.1690
0.05564 0.1955
0.05507 0.1950
0.04912 0.1670
0.04941 0.1777
0.05320 0.1855
0.04916 0.1798
0.05029 0.1790
0.04891 0.1745
0.04844 0.1775
0.05162 0.1851
0.05033 0.1850
0.05291 0.1895
0.04500 0.1684
0.05046 0.1808
0.05023 0.1786
0.04958 0.1780
0.05316 0.1929
0.05182 0.1923
0.05036 0.1882
0.04590 0.1745
0.04543 0.1677
0.05059 0.1828
0.04866 0.1791
0.05019 0.1903
0.05073 0.1864
0.04952 0.1862
0.04700 0.1795
0.04872 0.1865
The top 5 variables (out of 13):
Ab_42, tau, p_tau, VEGF, FAS
>
> ctrl$functions <- ldaFuncs
> ctrl$functions$summary <- fiveStats
>
> set.seed(721)
> ldaRFE <- rfe(training[, predVars],
+ training$Class,
+ sizes = varSeq,
+ metric = "ROC",
+ tol = 1.0e-12,
+ rfeControl = ctrl)
> ldaRFE
Recursive feature selection
Outer resampling method: Cross-Validated (10 fold, repeated 5 times)
Resampling performance over subset size:
Variables ROC Sens Spec Accuracy Kappa ROCSD SensSD SpecSD
1 0.8483 0.6621 0.8795 0.8201 0.5385 0.08787 0.2009 0.07003
3 0.8518 0.6243 0.8899 0.8172 0.5208 0.08546 0.2032 0.06947
5 0.8509 0.6211 0.8979 0.8216 0.5278 0.08457 0.2038 0.06042
7 0.8517 0.6154 0.9053 0.8255 0.5341 0.08337 0.2241 0.07148
9 0.8513 0.6264 0.9043 0.8278 0.5424 0.08574 0.2168 0.06818
11 0.8566 0.6318 0.9176 0.8391 0.5676 0.08869 0.2132 0.06175
13 0.8818 0.6736 0.9311 0.8603 0.6256 0.08136 0.2104 0.06389
15 0.8872 0.6779 0.9311 0.8617 0.6298 0.07954 0.1970 0.06091
17 0.8900 0.6729 0.9215 0.8532 0.6129 0.07704 0.1885 0.07138
19 0.8975 0.7050 0.9289 0.8676 0.6494 0.07703 0.1940 0.05814
21 0.9004 0.7050 0.9299 0.8683 0.6503 0.07364 0.1971 0.05628
23 0.9067 0.7125 0.9289 0.8698 0.6536 0.07214 0.2127 0.05903
25 0.9109 0.7193 0.9279 0.8708 0.6589 0.06827 0.2026 0.05995
27 0.9104 0.7350 0.9271 0.8745 0.6720 0.06855 0.1866 0.05694
29 0.9128 0.7404 0.9322 0.8798 0.6846 0.06828 0.1834 0.05535
31 0.9128 0.7346 0.9217 0.8706 0.6632 0.06917 0.1819 0.05352
33 0.9157 0.7429 0.9279 0.8774 0.6790 0.06941 0.1854 0.05217
35 0.9163 0.7407 0.9217 0.8721 0.6678 0.06746 0.1848 0.05660
37 0.9131 0.7436 0.9187 0.8706 0.6654 0.06615 0.1861 0.05812
39 0.9126 0.7461 0.9187 0.8714 0.6679 0.06456 0.1853 0.05899
41 0.9149 0.7436 0.9155 0.8684 0.6610 0.06764 0.1843 0.06073
43 0.9131 0.7486 0.9145 0.8691 0.6630 0.06749 0.1872 0.05956
45 0.9145 0.7539 0.9094 0.8669 0.6606 0.06560 0.1719 0.05729
47 0.9109 0.7411 0.9011 0.8572 0.6369 0.06528 0.1747 0.05511
49 0.9119 0.7471 0.9021 0.8595 0.6426 0.06766 0.1817 0.05519
51 0.9110 0.7471 0.9031 0.8601 0.6430 0.06583 0.1885 0.05267
53 0.9098 0.7443 0.9043 0.8601 0.6427 0.06406 0.1934 0.06022
55 0.9082 0.7300 0.9012 0.8541 0.6261 0.06495 0.1950 0.05753
57 0.9075 0.7350 0.9054 0.8586 0.6367 0.06390 0.1997 0.06148
59 0.9056 0.7357 0.9115 0.8632 0.6464 0.06710 0.1977 0.05784
61 0.9082 0.7357 0.9095 0.8617 0.6448 0.06461 0.1885 0.06244
63 0.9087 0.7300 0.9065 0.8579 0.6364 0.06374 0.1890 0.06966
65 0.9073 0.7364 0.9036 0.8573 0.6360 0.06500 0.1967 0.06857
67 0.9043 0.7411 0.9045 0.8595 0.6429 0.06666 0.1847 0.06917
69 0.8989 0.7414 0.9005 0.8566 0.6363 0.07321 0.1916 0.07001
71 0.8989 0.7386 0.9003 0.8557 0.6332 0.07140 0.1973 0.07053
73 0.8980 0.7332 0.9003 0.8542 0.6301 0.07119 0.1840 0.06976
75 0.8954 0.7354 0.8953 0.8514 0.6275 0.07105 0.1649 0.07786
77 0.8931 0.7354 0.8899 0.8475 0.6193 0.07323 0.1623 0.07480
79 0.8911 0.7461 0.8818 0.8445 0.6163 0.07300 0.1430 0.07030
81 0.8878 0.7489 0.8848 0.8474 0.6235 0.06987 0.1453 0.07379
83 0.8856 0.7382 0.8733 0.8360 0.5990 0.06906 0.1441 0.08200
85 0.8836 0.7350 0.8766 0.8376 0.6003 0.07030 0.1485 0.07922
87 0.8825 0.7296 0.8766 0.8362 0.5961 0.07112 0.1482 0.07726
89 0.8831 0.7189 0.8726 0.8304 0.5801 0.07049 0.1561 0.07203
91 0.8813 0.7293 0.8694 0.8310 0.5855 0.07322 0.1527 0.07665
93 0.8778 0.7236 0.8755 0.8340 0.5893 0.07479 0.1585 0.07555
95 0.8749 0.7339 0.8745 0.8362 0.5961 0.09342 0.1570 0.07480
97 0.8827 0.7282 0.8743 0.8345 0.5922 0.07452 0.1570 0.07919
99 0.8822 0.7371 0.8733 0.8362 0.5959 0.07307 0.1522 0.06643
101 0.8843 0.7196 0.8765 0.8339 0.5853 0.07526 0.1620 0.06258
103 0.8808 0.7164 0.8693 0.8278 0.5714 0.07495 0.1717 0.06534
105 0.8787 0.7318 0.8672 0.8301 0.5805 0.07423 0.1651 0.06133
107 0.8746 0.7096 0.8682 0.8249 0.5651 0.07805 0.1584 0.06424
109 0.8679 0.7036 0.8673 0.8227 0.5589 0.09389 0.1616 0.06383
111 0.8688 0.7064 0.8702 0.8257 0.5653 0.07951 0.1644 0.06510
113 0.8635 0.7182 0.8652 0.8251 0.5678 0.08714 0.1687 0.06935
115 0.8623 0.6993 0.8577 0.8145 0.5415 0.09186 0.1710 0.06934
117 0.8586 0.6968 0.8516 0.8093 0.5307 0.09189 0.1724 0.06903
119 0.8570 0.6979 0.8518 0.8099 0.5323 0.09064 0.1825 0.07745
121 0.8581 0.7093 0.8508 0.8121 0.5403 0.08832 0.1768 0.07823
123 0.8559 0.6957 0.8477 0.8064 0.5241 0.08573 0.1852 0.07355
125 0.8507 0.6907 0.8404 0.7996 0.5096 0.09223 0.1859 0.07252
127 0.8439 0.6771 0.8405 0.7959 0.4979 0.08763 0.1894 0.07237
129 0.8418 0.6739 0.8313 0.7883 0.4827 0.08636 0.1879 0.07310
131 0.8439 0.6857 0.8294 0.7900 0.4910 0.08593 0.1803 0.08189
132 0.8439 0.6857 0.8294 0.7900 0.4910 0.08593 0.1803 0.08189
AccuracySD KappaSD Selected
0.06212 0.1693
0.06092 0.1694
0.05878 0.1723
0.07356 0.2082
0.06843 0.1960
0.06755 0.1931
0.07173 0.1998
0.06132 0.1738
0.06709 0.1779
0.06469 0.1773
0.05814 0.1630
0.06293 0.1778
0.06107 0.1678
0.05941 0.1600
0.05484 0.1485
0.05414 0.1480
0.05343 0.1499
0.05565 0.1534 *
0.05921 0.1614
0.05946 0.1604
0.05870 0.1577
0.05789 0.1585
0.05406 0.1445
0.05679 0.1521
0.05793 0.1556
0.05796 0.1586
0.05942 0.1618
0.06066 0.1687
0.06102 0.1704
0.05765 0.1624
0.05590 0.1549
0.05753 0.1524
0.06565 0.1722
0.06434 0.1671
0.06391 0.1670
0.06698 0.1765
0.06320 0.1639
0.06074 0.1485
0.06176 0.1516
0.05911 0.1417
0.06292 0.1510
0.06461 0.1510
0.05977 0.1408
0.06272 0.1476
0.06231 0.1524
0.06917 0.1646
0.06792 0.1640
0.06741 0.1630
0.07049 0.1658
0.06348 0.1552
0.06082 0.1547
0.06131 0.1588
0.05867 0.1505
0.05767 0.1460
0.06143 0.1582
0.05859 0.1501
0.06193 0.1546
0.06218 0.1573
0.06459 0.1642
0.07084 0.1784
0.06828 0.1704
0.07060 0.1793
0.06999 0.1790
0.06957 0.1782
0.07027 0.1782
0.06589 0.1611
0.06589 0.1611
The top 5 variables (out of 35):
Ab_42, tau, p_tau, MMP10, MIF
>
> ctrl$functions <- nbFuncs
> ctrl$functions$summary <- fiveStats
> set.seed(721)
> nbRFE <- rfe(training[, predVars],
+ training$Class,
+ sizes = varSeq,
+ metric = "ROC",
+ rfeControl = ctrl)
> nbRFE
Recursive feature selection
Outer resampling method: Cross-Validated (10 fold, repeated 5 times)
Resampling performance over subset size:
Variables ROC Sens Spec Accuracy Kappa ROCSD SensSD SpecSD
1 0.8219 0.6286 0.8806 0.8112 0.5133 0.09390 0.1858 0.07246
3 0.8260 0.6171 0.8537 0.7886 0.4655 0.08952 0.1996 0.08506
5 0.8176 0.6200 0.8374 0.7774 0.4472 0.08568 0.1868 0.08760
7 0.8171 0.6107 0.8355 0.7737 0.4368 0.08333 0.1784 0.08128
9 0.8152 0.6093 0.8274 0.7672 0.4248 0.08766 0.1798 0.08402
11 0.8197 0.6143 0.8325 0.7723 0.4370 0.08881 0.1644 0.08098
13 0.8264 0.6532 0.8348 0.7845 0.4720 0.08559 0.1782 0.08413
15 0.8274 0.6582 0.8325 0.7844 0.4725 0.08184 0.1732 0.07366
17 0.8318 0.6807 0.8387 0.7950 0.5000 0.08452 0.1690 0.07622
19 0.8314 0.6671 0.8437 0.7948 0.4955 0.08804 0.1804 0.08144
21 0.8294 0.6589 0.8426 0.7918 0.4866 0.08704 0.1847 0.08180
23 0.8275 0.6457 0.8457 0.7904 0.4788 0.09091 0.1952 0.08045
25 0.8280 0.6436 0.8404 0.7859 0.4697 0.09197 0.1937 0.07888
27 0.8307 0.6436 0.8456 0.7896 0.4766 0.09182 0.1942 0.07845
29 0.8291 0.6300 0.8446 0.7852 0.4643 0.09237 0.1952 0.08216
31 0.8229 0.6182 0.8416 0.7799 0.4508 0.09497 0.1859 0.08083
33 0.8222 0.6182 0.8386 0.7777 0.4481 0.08826 0.1859 0.08690
35 0.8185 0.6264 0.8345 0.7769 0.4487 0.09244 0.1806 0.08364
37 0.8165 0.6243 0.8344 0.7761 0.4454 0.09084 0.1894 0.08191
39 0.8147 0.6214 0.8324 0.7740 0.4414 0.09174 0.1928 0.08403
41 0.8113 0.6139 0.8244 0.7659 0.4251 0.09145 0.1896 0.08912
43 0.8106 0.6111 0.8264 0.7667 0.4251 0.08928 0.1869 0.08561
45 0.8078 0.6025 0.8212 0.7606 0.4105 0.09236 0.1962 0.08997
47 0.8031 0.5971 0.8191 0.7576 0.4035 0.09325 0.1960 0.09122
49 0.8006 0.6021 0.8169 0.7574 0.4048 0.09371 0.1918 0.08948
51 0.7942 0.5993 0.8096 0.7514 0.3923 0.09367 0.1954 0.08918
53 0.7942 0.6021 0.8067 0.7500 0.3922 0.09352 0.1929 0.09279
55 0.7924 0.6025 0.8047 0.7486 0.3897 0.09154 0.1962 0.09277
57 0.7910 0.5968 0.8037 0.7463 0.3835 0.09229 0.1991 0.09369
59 0.7905 0.5939 0.8016 0.7441 0.3782 0.09206 0.1984 0.09330
61 0.7885 0.6054 0.8005 0.7463 0.3872 0.09605 0.1925 0.09374
63 0.7856 0.6025 0.8035 0.7477 0.3882 0.09639 0.1929 0.09137
65 0.7853 0.5993 0.7953 0.7409 0.3750 0.09680 0.1954 0.09268
67 0.7839 0.5996 0.7984 0.7432 0.3796 0.09714 0.1934 0.09261
69 0.7824 0.5943 0.7994 0.7425 0.3760 0.09728 0.1898 0.09267
71 0.7787 0.5996 0.7973 0.7425 0.3792 0.10154 0.1845 0.09806
73 0.7791 0.6025 0.7973 0.7432 0.3809 0.10094 0.1851 0.09763
75 0.7794 0.5996 0.7942 0.7402 0.3745 0.10232 0.1901 0.09662
77 0.7792 0.6018 0.7973 0.7432 0.3811 0.10076 0.1810 0.09632
79 0.7786 0.6100 0.7972 0.7453 0.3875 0.10145 0.1815 0.09363
81 0.7783 0.6150 0.7973 0.7469 0.3928 0.10362 0.1801 0.09841
83 0.7785 0.6100 0.7953 0.7440 0.3859 0.10308 0.1833 0.09937
85 0.7799 0.6043 0.7953 0.7424 0.3807 0.10384 0.1844 0.09601
87 0.7798 0.6096 0.7984 0.7462 0.3895 0.10427 0.1826 0.09587
89 0.7796 0.6096 0.7953 0.7439 0.3859 0.10255 0.1826 0.09853
91 0.7803 0.6043 0.7974 0.7439 0.3838 0.10025 0.1850 0.09901
93 0.7813 0.6071 0.8025 0.7484 0.3926 0.09988 0.1861 0.09907
95 0.7819 0.6046 0.8014 0.7469 0.3886 0.09946 0.1900 0.09792
97 0.7841 0.5989 0.8025 0.7462 0.3852 0.09933 0.1877 0.09702
99 0.7844 0.5986 0.8025 0.7462 0.3862 0.09856 0.1808 0.09932
101 0.7856 0.5982 0.8066 0.7492 0.3906 0.09764 0.1842 0.09638
103 0.7865 0.6007 0.8066 0.7499 0.3934 0.09868 0.1808 0.09638
105 0.7880 0.6032 0.8097 0.7529 0.3997 0.09868 0.1785 0.09609
107 0.7881 0.5982 0.8046 0.7476 0.3880 0.09737 0.1876 0.09781
109 0.7909 0.5954 0.8026 0.7454 0.3825 0.09565 0.1869 0.09620
111 0.7898 0.5929 0.8036 0.7454 0.3814 0.09557 0.1885 0.09590
113 0.7914 0.5954 0.8057 0.7476 0.3865 0.09535 0.1864 0.09604
115 0.7939 0.5982 0.8108 0.7522 0.3961 0.09499 0.1826 0.09499
117 0.7969 0.5986 0.8118 0.7530 0.3980 0.09534 0.1855 0.09748
119 0.7948 0.6039 0.8108 0.7537 0.4019 0.09458 0.1827 0.09978
121 0.7962 0.5986 0.8118 0.7529 0.3990 0.09327 0.1725 0.09878
123 0.7993 0.5986 0.8108 0.7522 0.3978 0.09368 0.1725 0.09916
125 0.7999 0.6039 0.8108 0.7537 0.4020 0.09421 0.1733 0.09732
127 0.7987 0.6014 0.8118 0.7538 0.4014 0.09424 0.1683 0.09737
129 0.7968 0.6014 0.8108 0.7530 0.4001 0.09664 0.1683 0.09732
131 0.7980 0.5936 0.8139 0.7530 0.3966 0.09522 0.1742 0.09706
132 0.7980 0.5936 0.8139 0.7530 0.3966 0.09522 0.1742 0.09706
AccuracySD KappaSD Selected
0.05981 0.1591
0.07148 0.1822
0.06939 0.1721
0.06715 0.1668
0.06916 0.1679
0.06712 0.1629
0.07135 0.1725
0.07058 0.1762
0.06595 0.1587 *
0.07284 0.1747
0.07193 0.1740
0.07296 0.1821
0.07436 0.1849
0.07286 0.1808
0.07526 0.1843
0.07277 0.1781
0.07880 0.1876
0.07058 0.1718
0.07097 0.1773
0.08000 0.1952
0.07929 0.1885
0.07735 0.1855
0.07910 0.1935
0.08068 0.1953
0.07778 0.1875
0.07609 0.1850
0.07929 0.1904
0.08009 0.1924
0.08398 0.2005
0.08018 0.1921
0.08105 0.1910
0.07906 0.1878
0.07920 0.1881
0.08124 0.1907
0.07596 0.1770
0.07690 0.1735
0.07299 0.1650
0.07609 0.1751
0.07310 0.1658
0.07230 0.1677
0.07457 0.1679
0.07593 0.1697
0.07698 0.1750
0.07872 0.1803
0.07928 0.1801
0.07925 0.1805
0.07853 0.1784
0.08007 0.1844
0.08021 0.1863
0.08105 0.1845
0.08156 0.1900
0.08351 0.1923
0.08267 0.1898
0.08395 0.1959
0.08036 0.1882
0.08116 0.1892
0.08057 0.1877
0.08211 0.1924
0.08577 0.2000
0.08526 0.1957
0.08264 0.1883
0.08220 0.1868
0.08051 0.1835
0.07932 0.1797
0.07973 0.1809
0.07924 0.1821
0.07924 0.1821
The top 5 variables (out of 17):
Ab_42, tau, p_tau, MMP10, MIF
>
> ## Here, the caretFuncs list allows for a model to be tuned at each iteration
> ## of feature seleciton.
>
> ctrl$functions <- caretFuncs
> ctrl$functions$summary <- fiveStats
>
> ## This options tells train() to run it's model tuning
> ## sequentially. Otherwise, there would be parallel processing at two
> ## levels, which is possible but requires W^2 workers. On our machine,
> ## it was more efficient to only run the RFE process in parallel.
>
> cvCtrl <- trainControl(method = "cv",
+ verboseIter = FALSE,
+ classProbs = TRUE,
+ allowParallel = FALSE)
>
> set.seed(721)
> svmRFE <- rfe(training[, predVars],
+ training$Class,
+ sizes = varSeq,
+ rfeControl = ctrl,
+ metric = "ROC",
+ ## Now arguments to train() are used.
+ method = "svmRadial",
+ tuneLength = 12,
+ preProc = c("center", "scale"),
+ trControl = cvCtrl)
> svmRFE
Recursive feature selection
Outer resampling method: Cross-Validated (10 fold, repeated 5 times)
Resampling performance over subset size:
Variables ROC Sens Spec Accuracy Kappa ROCSD SensSD SpecSD
1 0.5905 0.000000 0.9958 0.7237 -0.005263 0.09085 0.00000 0.023393
3 0.6070 0.005357 0.9927 0.7229 -0.002947 0.08997 0.02657 0.018319
5 0.5768 0.000000 0.9979 0.7252 -0.002887 0.09124 0.00000 0.010418
7 0.5708 0.002500 0.9989 0.7267 0.001849 0.09078 0.01768 0.007443
9 0.6014 0.000000 0.9969 0.7245 -0.004216 0.09225 0.00000 0.012209
11 0.6174 0.002500 0.9917 0.7215 -0.007576 0.08716 0.01768 0.026643
13 0.6005 0.000000 0.9959 0.7237 -0.005572 0.07947 0.00000 0.013886
15 0.6089 0.008571 0.9806 0.7148 -0.014105 0.10215 0.03427 0.038157
17 0.6058 0.013929 0.9826 0.7178 -0.004414 0.10541 0.04227 0.035314
19 0.6158 0.005714 0.9929 0.7230 -0.002504 0.08359 0.04041 0.022820
21 0.6314 0.005357 0.9855 0.7177 -0.012255 0.08684 0.02657 0.031549
23 0.6417 0.000000 0.9897 0.7192 -0.013806 0.10571 0.00000 0.025491
25 0.6309 0.029286 0.9846 0.7237 0.018380 0.09551 0.07095 0.040661
27 0.6458 0.023929 0.9919 0.7275 0.019326 0.09345 0.08064 0.021203
29 0.6488 0.054286 0.9859 0.7314 0.053679 0.09500 0.07627 0.030503
31 0.6480 0.053929 0.9809 0.7276 0.046498 0.09630 0.07589 0.042807
33 0.6725 0.062143 0.9601 0.7148 0.029185 0.09998 0.08970 0.059497
35 0.7088 0.153571 0.9467 0.7298 0.123175 0.10956 0.12796 0.061985
37 0.7022 0.168929 0.9486 0.7356 0.148679 0.10998 0.10407 0.061053
39 0.7493 0.326786 0.9245 0.7610 0.291271 0.10438 0.14757 0.065438
41 0.7538 0.334643 0.9267 0.7650 0.304769 0.09439 0.12942 0.057705
43 0.7654 0.382500 0.9164 0.7703 0.336838 0.10869 0.16560 0.074757
45 0.7888 0.448571 0.9062 0.7814 0.388675 0.08715 0.16468 0.077762
47 0.7972 0.468929 0.9022 0.7839 0.402476 0.08719 0.16326 0.072602
49 0.8024 0.452857 0.9137 0.7878 0.400875 0.08017 0.17391 0.067174
51 0.8041 0.464643 0.9136 0.7909 0.415948 0.08680 0.15094 0.067200
53 0.7975 0.452857 0.9127 0.7870 0.401976 0.08324 0.16405 0.067241
55 0.7853 0.413214 0.9009 0.7680 0.348189 0.08213 0.14752 0.077139
57 0.7844 0.438929 0.9135 0.7837 0.390411 0.07783 0.15230 0.070724
59 0.7870 0.415714 0.8961 0.7644 0.337122 0.07936 0.18739 0.072832
61 0.7980 0.450714 0.9074 0.7824 0.388683 0.08052 0.17584 0.065940
63 0.7826 0.421786 0.9082 0.7748 0.361639 0.08169 0.17747 0.063917
65 0.7864 0.443929 0.9002 0.7751 0.372336 0.08733 0.18343 0.070610
67 0.7906 0.460000 0.8948 0.7756 0.384999 0.08865 0.14952 0.076753
69 0.7866 0.434286 0.9051 0.7763 0.371835 0.08833 0.16216 0.064338
71 0.7906 0.456071 0.9035 0.7810 0.392684 0.09026 0.14814 0.064754
73 0.7866 0.418929 0.9075 0.7736 0.357854 0.09090 0.17686 0.065723
75 0.7833 0.429286 0.9002 0.7711 0.361158 0.09241 0.14791 0.064189
77 0.7918 0.420714 0.9021 0.7706 0.354054 0.09031 0.16442 0.061460
79 0.7923 0.432500 0.9053 0.7759 0.369002 0.09370 0.16919 0.062353
81 0.8004 0.471071 0.8983 0.7817 0.397339 0.08313 0.16998 0.064611
83 0.8019 0.465357 0.8994 0.7808 0.392492 0.09808 0.17318 0.066776
85 0.8168 0.515357 0.8972 0.7929 0.436919 0.07805 0.16860 0.064093
87 0.8209 0.498214 0.8983 0.7892 0.423366 0.07254 0.16296 0.061885
89 0.8242 0.512857 0.8982 0.7930 0.435858 0.07735 0.18215 0.063687
91 0.8274 0.502500 0.8973 0.7893 0.425311 0.07538 0.17414 0.062171
93 0.8262 0.497857 0.9063 0.7944 0.432279 0.08008 0.17617 0.054245
95 0.8206 0.497143 0.9064 0.7945 0.434122 0.07848 0.15594 0.055060
97 0.8232 0.488929 0.9105 0.7950 0.430581 0.07814 0.17154 0.055436
99 0.8223 0.500000 0.9075 0.7959 0.437668 0.07680 0.16779 0.062664
101 0.8218 0.504286 0.9054 0.7958 0.436402 0.07946 0.18323 0.059794
103 0.8279 0.536429 0.9085 0.8063 0.471852 0.08184 0.18360 0.064848
105 0.8267 0.543571 0.9014 0.8034 0.470228 0.08120 0.16639 0.065705
107 0.8251 0.541071 0.9064 0.8063 0.472569 0.07694 0.18172 0.059117
109 0.8268 0.551429 0.9034 0.8071 0.480250 0.07694 0.16333 0.063430
111 0.8179 0.527143 0.9002 0.7981 0.452569 0.08383 0.17475 0.065273
113 0.8156 0.522143 0.9025 0.7984 0.452796 0.08433 0.16804 0.067664
115 0.8138 0.510357 0.9075 0.7989 0.447754 0.08722 0.16841 0.063067
117 0.8131 0.528571 0.9013 0.7995 0.455319 0.08265 0.17139 0.063467
119 0.8188 0.532857 0.9095 0.8064 0.471617 0.08475 0.16431 0.060784
121 0.8225 0.533571 0.9044 0.8026 0.464322 0.08613 0.17599 0.068058
123 0.8245 0.538571 0.9022 0.8026 0.466771 0.08876 0.17423 0.071140
125 0.8815 0.680000 0.9343 0.8647 0.639461 0.08004 0.16685 0.055800
127 0.8912 0.701786 0.9282 0.8661 0.649271 0.07635 0.16151 0.066473
129 0.8900 0.701429 0.9302 0.8676 0.652656 0.07869 0.16370 0.066277
131 0.8914 0.691429 0.9302 0.8646 0.643526 0.07691 0.16485 0.063667
132 0.8893 0.674286 0.9322 0.8616 0.633022 0.07449 0.15082 0.058051
AccuracySD KappaSD Selected
0.02203 0.02882
0.01749 0.03100
0.01499 0.01428
0.01339 0.01307
0.01455 0.01686
0.02500 0.04221
0.01587 0.01909
0.02934 0.05591
0.02970 0.06116
0.01998 0.03986
0.02564 0.04107
0.02283 0.03372
0.03352 0.08812
0.02360 0.08907
0.02977 0.09988
0.03099 0.09088
0.04439 0.11421
0.04207 0.13477
0.04654 0.13361
0.05764 0.16761
0.05818 0.16646
0.06259 0.18393
0.06600 0.18597
0.06204 0.17159
0.05692 0.17232
0.06115 0.17055
0.06655 0.18785
0.06179 0.17173
0.06063 0.16750
0.06290 0.19192
0.05784 0.17611
0.06131 0.18280
0.06970 0.20398
0.06393 0.16434
0.06248 0.17773
0.06188 0.16790
0.06071 0.18004
0.05861 0.15959
0.06317 0.17973
0.06094 0.17690
0.06597 0.18691
0.06481 0.19130
0.05888 0.16835
0.05787 0.16558
0.07162 0.20140
0.06980 0.19570
0.06206 0.18188
0.05792 0.16732
0.05728 0.16662
0.06128 0.17683
0.06266 0.18665
0.06926 0.19777
0.06456 0.17840
0.06237 0.18096
0.06472 0.17957
0.06687 0.18908
0.06858 0.18914
0.06080 0.17232
0.06273 0.17956
0.06103 0.17401
0.06788 0.19012
0.06961 0.18870
0.06327 0.17249
0.05461 0.14531
0.06171 0.16233
0.05834 0.15316 *
0.05327 0.14264
The top 5 variables (out of 131):
Ab_42, tau, p_tau, MMP10, MIF
>
> ctrl$functions <- lrFuncs
> ctrl$functions$summary <- fiveStats
>
> set.seed(721)
> lrRFE <- rfe(training[, predVars],
+ training$Class,
+ sizes = varSeq,
+ metric = "ROC",
+ rfeControl = ctrl)
> lrRFE
Recursive feature selection
Outer resampling method: Cross-Validated (10 fold, repeated 5 times)
Resampling performance over subset size:
Variables ROC Sens Spec Accuracy Kappa ROCSD SensSD SpecSD
1 0.7600 0.3325 0.9313 0.7675 0.2868 0.13224 0.2611 0.06965
3 0.7787 0.4489 0.9054 0.7800 0.3692 0.13636 0.2786 0.07627
5 0.8002 0.5332 0.9148 0.8099 0.4651 0.14821 0.2762 0.07436
7 0.8300 0.6118 0.9067 0.8258 0.5318 0.12810 0.2435 0.07093
9 0.8497 0.6425 0.9035 0.8317 0.5561 0.10148 0.2136 0.06977
11 0.8550 0.6589 0.9062 0.8381 0.5792 0.09568 0.1699 0.06843
13 0.8571 0.6536 0.9053 0.8361 0.5732 0.09524 0.1703 0.06620
15 0.8543 0.6679 0.9000 0.8361 0.5778 0.10808 0.1706 0.07023
17 0.8529 0.6729 0.8837 0.8257 0.5564 0.10130 0.1704 0.06520
19 0.8562 0.6696 0.8866 0.8273 0.5562 0.10051 0.1889 0.06399
21 0.8515 0.6614 0.8826 0.8220 0.5445 0.10679 0.1867 0.06815
23 0.8473 0.6811 0.8703 0.8183 0.5447 0.10444 0.1812 0.07585
25 0.8523 0.6775 0.8682 0.8160 0.5386 0.10208 0.1808 0.07375
27 0.8448 0.6775 0.8651 0.8138 0.5336 0.10287 0.1871 0.07084
29 0.8369 0.6914 0.8621 0.8153 0.5406 0.11255 0.1918 0.07864
31 0.8172 0.6743 0.8518 0.8032 0.5125 0.14429 0.1901 0.07863
33 0.8239 0.6804 0.8417 0.7974 0.5029 0.10737 0.1839 0.07630
35 0.7846 0.6850 0.8249 0.7866 0.4862 0.14152 0.1684 0.07715
37 0.7456 0.6629 0.8212 0.7778 0.4625 0.15954 0.1755 0.07874
39 0.7291 0.6646 0.8136 0.7732 0.4540 0.15947 0.1853 0.08543
41 0.7472 0.6707 0.8197 0.7792 0.4659 0.13699 0.1816 0.07805
43 0.7364 0.6468 0.8153 0.7691 0.4400 0.14810 0.1897 0.08137
45 0.7636 0.6746 0.8003 0.7657 0.4450 0.10668 0.1683 0.09067
47 0.7619 0.6904 0.8011 0.7706 0.4602 0.12478 0.1685 0.09794
49 0.7720 0.6782 0.8156 0.7776 0.4673 0.11553 0.1853 0.09389
51 0.7819 0.7029 0.8099 0.7800 0.4813 0.11128 0.1693 0.09576
53 0.7836 0.6939 0.8213 0.7860 0.4916 0.11668 0.1542 0.09829
55 0.7984 0.7000 0.8159 0.7838 0.4902 0.08453 0.1478 0.10211
57 0.7741 0.6768 0.8151 0.7765 0.4683 0.12412 0.1706 0.10082
59 0.7795 0.6657 0.8119 0.7710 0.4551 0.12299 0.1737 0.10371
61 0.7921 0.6743 0.8189 0.7786 0.4707 0.10119 0.1800 0.09823
63 0.7885 0.6757 0.8024 0.7674 0.4501 0.10087 0.1745 0.09314
65 0.7939 0.6786 0.8106 0.7740 0.4637 0.10055 0.1827 0.10282
67 0.7955 0.6511 0.8046 0.7621 0.4327 0.09315 0.1860 0.09935
69 0.7980 0.6871 0.8036 0.7712 0.4634 0.10358 0.1645 0.10550
71 0.7881 0.6864 0.7944 0.7645 0.4525 0.10688 0.1845 0.11695
73 0.7837 0.6632 0.7944 0.7577 0.4294 0.10418 0.1899 0.10392
75 0.7841 0.6668 0.7923 0.7570 0.4286 0.10367 0.1970 0.10784
77 0.7805 0.6682 0.7961 0.7605 0.4370 0.10579 0.1826 0.11082
79 0.7812 0.6696 0.7985 0.7628 0.4430 0.10462 0.1768 0.11188
81 0.7837 0.6621 0.7901 0.7545 0.4259 0.09616 0.1793 0.11344
83 0.7837 0.6486 0.7881 0.7493 0.4109 0.09257 0.1870 0.11317
85 0.7843 0.6600 0.7858 0.7508 0.4192 0.09711 0.1624 0.11135
87 0.7869 0.6350 0.7870 0.7447 0.3995 0.08773 0.1785 0.11444
89 0.7912 0.6679 0.7838 0.7514 0.4236 0.08960 0.1570 0.11414
91 0.7962 0.6764 0.7851 0.7545 0.4329 0.08569 0.1592 0.11996
93 0.7918 0.6875 0.7828 0.7559 0.4353 0.08811 0.1699 0.10608
95 0.7920 0.6689 0.7768 0.7463 0.4130 0.08550 0.1707 0.10948
97 0.7834 0.6632 0.7791 0.7463 0.4117 0.09253 0.1660 0.11186
99 0.7832 0.6657 0.7747 0.7438 0.4072 0.08899 0.1731 0.10941
101 0.7851 0.6679 0.7778 0.7470 0.4150 0.09378 0.1672 0.11201
103 0.7876 0.6682 0.7758 0.7455 0.4119 0.09109 0.1716 0.11139
105 0.7872 0.6725 0.7842 0.7529 0.4282 0.09882 0.1589 0.11508
107 0.7869 0.6775 0.7852 0.7552 0.4330 0.10293 0.1613 0.11178
109 0.7845 0.6664 0.7841 0.7515 0.4235 0.11155 0.1595 0.11597
111 0.7831 0.6646 0.7746 0.7440 0.4099 0.10095 0.1708 0.11756
113 0.7830 0.6646 0.7788 0.7470 0.4131 0.09778 0.1708 0.10983
115 0.7841 0.6643 0.7778 0.7462 0.4123 0.09882 0.1659 0.11286
117 0.7827 0.6696 0.7819 0.7507 0.4220 0.10605 0.1594 0.10893
119 0.7831 0.6675 0.7831 0.7508 0.4195 0.10265 0.1760 0.10406
121 0.7848 0.6721 0.7779 0.7485 0.4188 0.10165 0.1679 0.11203
123 0.7839 0.6675 0.7779 0.7471 0.4147 0.10471 0.1686 0.11040
125 0.7822 0.6696 0.7779 0.7478 0.4175 0.10507 0.1696 0.11471
127 0.7818 0.6696 0.7779 0.7479 0.4173 0.10490 0.1632 0.10984
129 0.7825 0.6693 0.7788 0.7485 0.4179 0.10320 0.1659 0.10946
131 0.7846 0.6696 0.7779 0.7478 0.4170 0.10057 0.1652 0.10989
132 0.7846 0.6696 0.7779 0.7478 0.4170 0.10057 0.1652 0.10989
AccuracySD KappaSD Selected
0.07066 0.2617
0.08671 0.2829
0.08625 0.2690
0.09111 0.2633
0.07383 0.2043
0.06849 0.1733
0.06522 0.1667 *
0.06775 0.1702
0.06406 0.1631
0.06541 0.1750
0.06900 0.1806
0.07172 0.1767
0.07361 0.1815
0.07640 0.1907
0.07496 0.1864
0.07783 0.1932
0.06920 0.1712
0.07455 0.1774
0.07537 0.1831
0.07624 0.1800
0.06610 0.1632
0.07209 0.1740
0.05611 0.1242
0.06826 0.1483
0.06754 0.1534
0.06433 0.1383
0.06896 0.1485
0.06929 0.1391
0.07736 0.1652
0.07985 0.1716
0.08330 0.1842
0.08009 0.1732
0.08553 0.1853
0.08245 0.1841
0.07897 0.1662
0.09083 0.1910
0.07839 0.1723
0.07590 0.1653
0.07576 0.1576
0.08382 0.1767
0.08406 0.1725
0.08361 0.1763
0.07799 0.1526
0.08361 0.1739
0.07821 0.1536
0.08338 0.1619
0.07202 0.1492
0.07270 0.1453
0.07160 0.1409
0.07181 0.1454
0.08167 0.1649
0.07988 0.1632
0.08179 0.1628
0.08254 0.1655
0.08433 0.1691
0.09071 0.1863
0.08187 0.1732
0.08055 0.1652
0.08032 0.1643
0.08034 0.1761
0.08198 0.1721
0.08217 0.1736
0.08674 0.1802
0.08394 0.1752
0.08337 0.1753
0.08250 0.1718
0.08250 0.1718
The top 5 variables (out of 13):
tau, Cortisol, VEGF, Clusterin_Apo_J, Fetuin_A
>
> ctrl$functions <- caretFuncs
> ctrl$functions$summary <- fiveStats
>
> set.seed(721)
> knnRFE <- rfe(training[, predVars],
+ training$Class,
+ sizes = varSeq,
+ metric = "ROC",
+ method = "knn",
+ tuneLength = 20,
+ preProc = c("center", "scale"),
+ trControl = cvCtrl,
+ rfeControl = ctrl)
> knnRFE
Recursive feature selection
Outer resampling method: Cross-Validated (10 fold, repeated 5 times)
Resampling performance over subset size:
Variables ROC Sens Spec Accuracy Kappa ROCSD SensSD SpecSD
1 0.6064 0.000000 0.9979 0.7252 -0.002829 0.11620 0.00000 0.01016
3 0.6105 0.021071 0.9813 0.7191 0.003301 0.10610 0.05503 0.03447
5 0.6030 0.010714 0.9783 0.7139 -0.014339 0.12140 0.04462 0.04226
7 0.6138 0.005000 0.9877 0.7193 -0.009955 0.11161 0.02474 0.02842
9 0.6113 0.035000 0.9701 0.7147 0.006480 0.10063 0.07829 0.04468
11 0.5891 0.000000 0.9917 0.7208 -0.010703 0.08800 0.00000 0.02841
13 0.5949 0.000000 1.0000 0.7267 0.000000 0.10663 0.00000 0.00000
15 0.5941 0.005000 0.9836 0.7162 -0.015025 0.10248 0.02474 0.03487
17 0.5921 0.000000 0.9907 0.7200 -0.012532 0.09876 0.00000 0.02263
19 0.6052 0.000000 0.9979 0.7252 -0.002718 0.10539 0.00000 0.01489
21 0.6080 0.002857 0.9907 0.7207 -0.008502 0.10139 0.02020 0.02488
23 0.6283 0.005357 0.9857 0.7177 -0.011870 0.10618 0.02657 0.03255
25 0.6226 0.005357 0.9918 0.7222 -0.003806 0.09542 0.02657 0.02794
27 0.6105 0.002857 0.9855 0.7169 -0.015451 0.10561 0.02020 0.03329
29 0.6202 0.005357 0.9908 0.7216 -0.004856 0.12201 0.02657 0.02659
31 0.5902 0.016786 0.9885 0.7229 0.007459 0.11531 0.04598 0.02856
33 0.6038 0.026786 0.9848 0.7230 0.015065 0.12858 0.06146 0.03405
35 0.6339 0.027143 0.9795 0.7191 0.009051 0.13339 0.06209 0.03964
37 0.6154 0.048929 0.9702 0.7184 0.022169 0.13454 0.10529 0.04676
39 0.6710 0.104643 0.9598 0.7260 0.082074 0.13048 0.10641 0.04772
41 0.6559 0.117857 0.9694 0.7365 0.112048 0.12886 0.11595 0.04790
43 0.6601 0.137857 0.9538 0.7306 0.114381 0.11575 0.13103 0.04776
45 0.6602 0.112500 0.9487 0.7200 0.076461 0.13013 0.12582 0.05479
47 0.6943 0.146429 0.9467 0.7275 0.113919 0.12200 0.12367 0.05346
49 0.6745 0.127500 0.9508 0.7255 0.091337 0.13207 0.16291 0.06245
51 0.7090 0.197500 0.9425 0.7382 0.164804 0.11652 0.16601 0.05713
53 0.6945 0.164286 0.9538 0.7373 0.142690 0.12548 0.15692 0.06317
55 0.6978 0.166786 0.9536 0.7381 0.145468 0.12016 0.15868 0.05358
57 0.7224 0.225000 0.9301 0.7372 0.182435 0.11323 0.16346 0.07062
59 0.7086 0.198214 0.9373 0.7353 0.161550 0.12940 0.16359 0.06431
61 0.7131 0.173571 0.9526 0.7397 0.152089 0.14681 0.15374 0.05339
63 0.6994 0.201786 0.9508 0.7459 0.185907 0.12576 0.15011 0.05695
65 0.7067 0.172500 0.9517 0.7388 0.154809 0.12459 0.12248 0.05706
67 0.7015 0.161429 0.9373 0.7251 0.115826 0.11470 0.15530 0.06475
69 0.7096 0.178571 0.9415 0.7331 0.143170 0.11379 0.17086 0.07142
71 0.7136 0.216786 0.9288 0.7342 0.173306 0.10437 0.15212 0.06756
73 0.6874 0.234286 0.9269 0.7377 0.193418 0.16087 0.14728 0.07524
75 0.7146 0.177500 0.9496 0.7389 0.159190 0.12630 0.11709 0.05001
77 0.7189 0.200357 0.9435 0.7403 0.171891 0.14274 0.16054 0.05611
79 0.7220 0.176786 0.9466 0.7359 0.148615 0.13065 0.15037 0.05573
81 0.7367 0.227143 0.9597 0.7591 0.227189 0.11942 0.15589 0.04592
83 0.7392 0.260000 0.9473 0.7597 0.251253 0.12542 0.13251 0.05247
85 0.7319 0.218214 0.9570 0.7548 0.213702 0.13456 0.15718 0.05740
87 0.7428 0.259643 0.9516 0.7623 0.252703 0.15221 0.16001 0.05009
89 0.7439 0.274643 0.9352 0.7545 0.246367 0.11719 0.15914 0.05713
91 0.7595 0.283214 0.9369 0.7583 0.257821 0.09331 0.16323 0.05074
93 0.7409 0.256071 0.9350 0.7494 0.228272 0.11549 0.13400 0.05348
95 0.7524 0.250714 0.9342 0.7471 0.217301 0.09978 0.15667 0.05147
97 0.7306 0.238214 0.9353 0.7449 0.203307 0.11405 0.15938 0.04691
99 0.7308 0.280000 0.9268 0.7500 0.241501 0.12963 0.15012 0.05106
101 0.7265 0.255357 0.9312 0.7465 0.215970 0.10291 0.17091 0.05591
103 0.7197 0.280714 0.9372 0.7577 0.253722 0.14222 0.17396 0.05224
105 0.7279 0.234286 0.9436 0.7499 0.211755 0.14193 0.15926 0.05479
107 0.7456 0.247857 0.9465 0.7554 0.233926 0.11890 0.15194 0.05653
109 0.7507 0.255000 0.9393 0.7522 0.229079 0.09635 0.16771 0.05939
111 0.7461 0.255000 0.9590 0.7666 0.259253 0.12612 0.14699 0.04257
113 0.7518 0.267143 0.9447 0.7592 0.252789 0.09650 0.14879 0.06038
115 0.7713 0.258214 0.9589 0.7672 0.263819 0.08508 0.14801 0.05072
117 0.7702 0.243571 0.9498 0.7570 0.234860 0.09519 0.14184 0.05952
119 0.7605 0.220357 0.9495 0.7499 0.205067 0.08238 0.15292 0.05275
121 0.7743 0.241786 0.9517 0.7574 0.237198 0.09812 0.12561 0.05411
123 0.7584 0.280714 0.9467 0.7644 0.268479 0.08869 0.16852 0.05705
125 0.7942 0.339286 0.9506 0.7832 0.338318 0.07205 0.16489 0.04618
127 0.7731 0.387857 0.9569 0.8013 0.400311 0.13220 0.16082 0.04310
129 0.7984 0.422143 0.9548 0.8088 0.432632 0.07905 0.17562 0.05481
131 0.7769 0.424286 0.9436 0.8014 0.418138 0.13265 0.17293 0.05624
132 0.7769 0.402143 0.9475 0.7983 0.403517 0.11562 0.15858 0.05861
AccuracySD KappaSD Selected
0.01441 0.01401
0.03140 0.08531
0.03484 0.07018
0.02289 0.02972
0.03912 0.11306
0.02571 0.03597
0.01339 0.00000
0.02541 0.04511
0.02035 0.03032
0.01688 0.01922
0.02391 0.04407
0.02572 0.04782
0.02093 0.02935
0.02557 0.04149
0.02467 0.04702
0.02648 0.06738
0.02391 0.07722
0.03150 0.09220
0.03618 0.12457
0.03880 0.13217
0.04481 0.15396
0.05047 0.17032
0.05463 0.17769
0.03810 0.13242
0.04479 0.15554
0.04737 0.17116
0.04601 0.15784
0.04827 0.16931
0.05562 0.17556
0.05840 0.19088
0.04476 0.16073
0.05346 0.17446
0.04638 0.14586
0.04167 0.14837
0.06196 0.19803
0.05361 0.16956
0.06433 0.18473
0.04873 0.15096
0.05361 0.18296
0.04770 0.16618
0.05196 0.18409
0.05340 0.16931
0.05538 0.18451
0.05248 0.17640
0.05345 0.17649
0.05379 0.18031
0.04933 0.14716
0.05112 0.17577
0.04758 0.17363
0.05599 0.17706
0.04843 0.16940
0.05620 0.19190
0.05101 0.17896
0.05500 0.17980
0.05292 0.18030
0.04626 0.16770
0.04583 0.14661
0.05063 0.17167
0.05027 0.16002
0.04999 0.16962
0.04716 0.14377
0.05584 0.18904
0.04910 0.16983
0.04807 0.16637
0.06520 0.19926 *
0.06499 0.19222
0.06089 0.17843
The top 5 variables (out of 129):
Ab_42, tau, p_tau, MMP10, MIF
>
> ## Each of these models can be evaluate using the plot() function to see
> ## the profile across subset sizes.
>
> ## Test set ROC results:
> rfROCfull <- roc(testing$Class,
+ predict(rfFull, testing[,predVars], type = "prob")[,1])
> rfROCfull
Call:
roc.default(response = testing$Class, predictor = predict(rfFull, testing[, predVars], type = "prob")[, 1])
Data: predict(rfFull, testing[, predVars], type = "prob")[, 1] in 18 controls (testing$Class Impaired) > 48 cases (testing$Class Control).
Area under the curve: 0.9034
> rfROCrfe <- roc(testing$Class,
+ predict(rfRFE, testing[,predVars])$Impaired)
> rfROCrfe
Call:
roc.default(response = testing$Class, predictor = predict(rfRFE, testing[, predVars])$Impaired)
Data: predict(rfRFE, testing[, predVars])$Impaired in 18 controls (testing$Class Impaired) > 48 cases (testing$Class Control).
Area under the curve: 0.8941
>
> ldaROCfull <- roc(testing$Class,
+ predict(ldaFull, testing[,predVars], type = "prob")[,1])
> ldaROCfull
Call:
roc.default(response = testing$Class, predictor = predict(ldaFull, testing[, predVars], type = "prob")[, 1])
Data: predict(ldaFull, testing[, predVars], type = "prob")[, 1] in 18 controls (testing$Class Impaired) > 48 cases (testing$Class Control).
Area under the curve: 0.8981
> ldaROCrfe <- roc(testing$Class,
+ predict(ldaRFE, testing[,predVars])$Impaired)
> ldaROCrfe
Call:
roc.default(response = testing$Class, predictor = predict(ldaRFE, testing[, predVars])$Impaired)
Data: predict(ldaRFE, testing[, predVars])$Impaired in 18 controls (testing$Class Impaired) > 48 cases (testing$Class Control).
Area under the curve: 0.9259
>
> nbROCfull <- roc(testing$Class,
+ predict(nbFull, testing[,predVars], type = "prob")[,1])
There were 50 or more warnings (use warnings() to see the first 50)
> nbROCfull
Call:
roc.default(response = testing$Class, predictor = predict(nbFull, testing[, predVars], type = "prob")[, 1])
Data: predict(nbFull, testing[, predVars], type = "prob")[, 1] in 18 controls (testing$Class Impaired) > 48 cases (testing$Class Control).
Area under the curve: 0.8287
> nbROCrfe <- roc(testing$Class,
+ predict(nbRFE, testing[,predVars])$Impaired)
Warning message:
In FUN(1:66[[66L]], ...) :
Numerical 0 probability for all classes with observation 22
> nbROCrfe
Call:
roc.default(response = testing$Class, predictor = predict(nbRFE, testing[, predVars])$Impaired)
Data: predict(nbRFE, testing[, predVars])$Impaired in 18 controls (testing$Class Impaired) > 48 cases (testing$Class Control).
Area under the curve: 0.8565
>
> svmROCfull <- roc(testing$Class,
+ predict(svmFull, testing[,predVars], type = "prob")[,1])
> svmROCfull
Call:
roc.default(response = testing$Class, predictor = predict(svmFull, testing[, predVars], type = "prob")[, 1])
Data: predict(svmFull, testing[, predVars], type = "prob")[, 1] in 18 controls (testing$Class Impaired) > 48 cases (testing$Class Control).
Area under the curve: 0.8727
> svmROCrfe <- roc(testing$Class,
+ predict(svmRFE, testing[,predVars])$Impaired)
> svmROCrfe
Call:
roc.default(response = testing$Class, predictor = predict(svmRFE, testing[, predVars])$Impaired)
Data: predict(svmRFE, testing[, predVars])$Impaired in 18 controls (testing$Class Impaired) > 48 cases (testing$Class Control).
Area under the curve: 0.8681
>
> lrROCfull <- roc(testing$Class,
+ predict(lrFull, testing[,predVars], type = "prob")[,1])
> lrROCfull
Call:
roc.default(response = testing$Class, predictor = predict(lrFull, testing[, predVars], type = "prob")[, 1])
Data: predict(lrFull, testing[, predVars], type = "prob")[, 1] in 18 controls (testing$Class Impaired) > 48 cases (testing$Class Control).
Area under the curve: 0.8513
> lrROCrfe <- roc(testing$Class,
+ predict(lrRFE, testing[,predVars])$Impaired)
> lrROCrfe
Call:
roc.default(response = testing$Class, predictor = predict(lrRFE, testing[, predVars])$Impaired)
Data: predict(lrRFE, testing[, predVars])$Impaired in 18 controls (testing$Class Impaired) > 48 cases (testing$Class Control).
Area under the curve: 0.89
>
> knnROCfull <- roc(testing$Class,
+ predict(knnFull, testing[,predVars], type = "prob")[,1])
> knnROCfull
Call:
roc.default(response = testing$Class, predictor = predict(knnFull, testing[, predVars], type = "prob")[, 1])
Data: predict(knnFull, testing[, predVars], type = "prob")[, 1] in 18 controls (testing$Class Impaired) > 48 cases (testing$Class Control).
Area under the curve: 0.8762
> knnROCrfe <- roc(testing$Class,
+ predict(knnRFE, testing[,predVars])$Impaired)
> knnROCrfe
Call:
roc.default(response = testing$Class, predictor = predict(knnRFE, testing[, predVars])$Impaired)
Data: predict(knnRFE, testing[, predVars])$Impaired in 18 controls (testing$Class Impaired) > 48 cases (testing$Class Control).
Area under the curve: 0.8391
>
>
> ## For filter methods, the sbf() function (named for Selection By Filter) is
> ## used. It has similar arguments to rfe() to control the model fitting and
> ## filtering methods.
>
> ## P-values are created for filtering.
>
> ## A set of four LDA models are fit based on two factors: p-value adjustment
> ## using a Bonferroni adjustment and whether the predictors should be
> ## pre-screened for high correlations.
>
> sbfResamp <- function(x, fun = mean)
+ {
+ x <- unlist(lapply(x$variables, length))
+ fun(x)
+ }
> sbfROC <- function(mod) auc(roc(testing$Class, predict(mod, testing)$Impaired))
>
> ## This function calculates p-values using either a t-test (when the predictor
> ## has 2+ distinct values) or using Fisher's Exact Test otherwise.
>
> pScore <- function(x, y)
+ {
+ numX <- length(unique(x))
+ if(numX > 2)
+ {
+ out <- t.test(x ~ y)$p.value
+ } else {
+ out <- fisher.test(factor(x), y)$p.value
+ }
+ out
+ }
> ldaWithPvalues <- ldaSBF
> ldaWithPvalues$score <- pScore
> ldaWithPvalues$summary <- fiveStats
>
> ## Predictors are retained if their p-value is less than the completely
> ## subjective cut-off of 0.05.
>
> ldaWithPvalues$filter <- function (score, x, y)
+ {
+ keepers <- score <= 0.05
+ keepers
+ }
>
> sbfCtrl <- sbfControl(method = "repeatedcv",
+ repeats = 5,
+ verbose = TRUE,
+ functions = ldaWithPvalues,
+ index = index)
>
> rawCorr <- sbf(training[, predVars],
+ training$Class,
+ tol = 1.0e-12,
+ sbfControl = sbfCtrl)
> rawCorr
Selection By Filter
Outer resampling method: Cross-Validated (10 fold, repeated 5 times)
Resampling performance:
ROC Sens Spec Accuracy Kappa ROCSD SensSD SpecSD AccuracySD KappaSD
0.9168 0.7439 0.9136 0.867 0.6588 0.06458 0.1778 0.05973 0.0567 0.1512
Using the training set, 47 variables were selected:
Alpha_1_Antitrypsin, Apolipoprotein_D, B_Lymphocyte_Chemoattractant_BL, Complement_3, Cortisol...
During resampling, the top 5 selected variables (out of a possible 66):
Ab_42 (100%), age (100%), Cortisol (100%), Creatine_Kinase_MB (100%), Cystatin_C (100%)
On average, 46.1 variables were selected (min = 38, max = 57)
>
> ldaWithPvalues$filter <- function (score, x, y)
+ {
+ score <- p.adjust(score, "bonferroni")
+ keepers <- score <= 0.05
+ keepers
+ }
> sbfCtrl <- sbfControl(method = "repeatedcv",
+ repeats = 5,
+ verbose = TRUE,
+ functions = ldaWithPvalues,
+ index = index)
>
> adjCorr <- sbf(training[, predVars],
+ training$Class,
+ tol = 1.0e-12,
+ sbfControl = sbfCtrl)
> adjCorr
Selection By Filter
Outer resampling method: Cross-Validated (10 fold, repeated 5 times)
Resampling performance:
ROC Sens Spec Accuracy Kappa ROCSD SensSD SpecSD AccuracySD KappaSD
0.8563 0.6443 0.9083 0.8361 0.5663 0.07646 0.201 0.06721 0.06283 0.1778
Using the training set, 17 variables were selected:
Creatine_Kinase_MB, Eotaxin_3, FAS, GRO_alpha, IGF_BP_2...
During resampling, the top 5 selected variables (out of a possible 23):
Ab_42 (100%), GRO_alpha (100%), MIF (100%), p_tau (100%), tau (100%)
On average, 13.5 variables were selected (min = 9, max = 19)
>
> ldaWithPvalues$filter <- function (score, x, y)
+ {
+ keepers <- score <= 0.05
+ corrMat <- cor(x[,keepers])
+ tooHigh <- findCorrelation(corrMat, .75)
+ if(length(tooHigh) > 0) keepers[tooHigh] <- FALSE
+ keepers
+ }
> sbfCtrl <- sbfControl(method = "repeatedcv",
+ repeats = 5,
+ verbose = TRUE,
+ functions = ldaWithPvalues,
+ index = index)
>
> rawNoCorr <- sbf(training[, predVars],
+ training$Class,
+ tol = 1.0e-12,
+ sbfControl = sbfCtrl)
> rawNoCorr
Selection By Filter
Outer resampling method: Cross-Validated (10 fold, repeated 5 times)
Resampling performance:
ROC Sens Spec Accuracy Kappa ROCSD SensSD SpecSD AccuracySD KappaSD
0.918 0.7357 0.9125 0.8638 0.6508 0.06282 0.1787 0.06498 0.05687 0.1474
Using the training set, 45 variables were selected:
Alpha_1_Antitrypsin, Apolipoprotein_D, B_Lymphocyte_Chemoattractant_BL, Complement_3, Cortisol...
During resampling, the top 5 selected variables (out of a possible 66):
Ab_42 (100%), age (100%), E4 (100%), IGF_BP_2 (100%), IL_17E (100%)
On average, 44.3 variables were selected (min = 37, max = 54)
>
> ldaWithPvalues$filter <- function (score, x, y)
+ {
+ score <- p.adjust(score, "bonferroni")
+ keepers <- score <= 0.05
+ corrMat <- cor(x[,keepers])
+ tooHigh <- findCorrelation(corrMat, .75)
+ if(length(tooHigh) > 0) keepers[tooHigh] <- FALSE
+ keepers
+ }
> sbfCtrl <- sbfControl(method = "repeatedcv",
+ repeats = 5,
+ verbose = TRUE,
+ functions = ldaWithPvalues,
+ index = index)
>
> adjNoCorr <- sbf(training[, predVars],
+ training$Class,
+ tol = 1.0e-12,
+ sbfControl = sbfCtrl)
> adjNoCorr
Selection By Filter
Outer resampling method: Cross-Validated (10 fold, repeated 5 times)
Resampling performance:
ROC Sens Spec Accuracy Kappa ROCSD SensSD SpecSD AccuracySD KappaSD
0.8563 0.6443 0.9083 0.8361 0.5663 0.07646 0.201 0.06721 0.06283 0.1778
Using the training set, 17 variables were selected:
Creatine_Kinase_MB, Eotaxin_3, FAS, GRO_alpha, IGF_BP_2...
During resampling, the top 5 selected variables (out of a possible 23):
Ab_42 (100%), GRO_alpha (100%), MIF (100%), p_tau (100%), tau (100%)
On average, 13.5 variables were selected (min = 9, max = 19)
>
> ## Filter methods test set ROC results:
>
> sbfROC(rawCorr)
Area under the curve: 0.9178
> sbfROC(rawNoCorr)
Area under the curve: 0.9155
> sbfROC(adjCorr)
Area under the curve: 0.9259
> sbfROC(adjNoCorr)
Area under the curve: 0.9259
>
> ## Get the resampling results for all the models
>
> rfeResamples <- resamples(list(RF = rfRFE,
+ "Logistic Reg." = lrRFE,
+ "SVM" = svmRFE,
+ "$K$--NN" = knnRFE,
+ "N. Bayes" = nbRFE,
+ "LDA" = ldaRFE))
> summary(rfeResamples)
Call:
summary.resamples(object = rfeResamples)
Models: RF, Logistic Reg., SVM, $K$--NN, N. Bayes, LDA
Number of resamples: 50
ROC
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
RF 0.3714 0.8694 0.9229 0.8996 0.9611 1.0000 0
Logistic Reg. 0.6429 0.7984 0.8571 0.8571 0.9370 1.0000 0
SVM 0.7000 0.8421 0.8947 0.8914 0.9611 1.0000 0
$K$--NN 0.6283 0.7332 0.8004 0.7984 0.8709 0.9211 0
N. Bayes 0.6357 0.7759 0.8346 0.8318 0.8797 0.9925 0
LDA 0.7429 0.8716 0.9312 0.9163 0.9783 1.0000 0
Sens
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
RF 0.2857 0.5714 0.7143 0.6696 0.7500 1.0000 0
Logistic Reg. 0.3750 0.5714 0.6250 0.6536 0.7411 1.0000 0
SVM 0.3750 0.5714 0.7143 0.6914 0.7500 1.0000 0
$K$--NN 0.1250 0.2857 0.4286 0.4221 0.5714 0.7143 0
N. Bayes 0.2857 0.5714 0.7143 0.6807 0.7500 1.0000 0
LDA 0.2500 0.6250 0.7143 0.7407 0.8571 1.0000 0
Spec
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
RF 0.8500 0.9474 1.0000 0.9650 1.0000 1 0
Logistic Reg. 0.7000 0.8500 0.9000 0.9053 0.9474 1 0
SVM 0.7368 0.8947 0.9474 0.9302 1.0000 1 0
$K$--NN 0.7000 0.9474 0.9500 0.9548 1.0000 1 0
N. Bayes 0.6500 0.8000 0.8421 0.8387 0.8947 1 0
LDA 0.7895 0.8947 0.9000 0.9217 0.9500 1 0
Accuracy
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
RF 0.7407 0.8519 0.8889 0.8839 0.9252 0.9630 0
Logistic Reg. 0.6667 0.7912 0.8462 0.8361 0.8777 0.9630 0
SVM 0.7692 0.8148 0.8519 0.8646 0.8929 1.0000 0
$K$--NN 0.6071 0.7778 0.8113 0.8088 0.8462 0.9231 0
N. Bayes 0.6538 0.7500 0.7778 0.7950 0.8462 0.9630 0
LDA 0.7407 0.8276 0.8846 0.8721 0.9231 1.0000 0
Kappa
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
RF 0.21580 0.5738 0.7027 0.6791 0.7874 0.9078 0
Logistic Reg. 0.23820 0.4717 0.5702 0.5732 0.6676 0.9078 0
SVM 0.35540 0.5408 0.6157 0.6435 0.7450 1.0000 0
$K$--NN 0.05263 0.3307 0.4348 0.4326 0.5737 0.7851 0
N. Bayes 0.21800 0.3999 0.4957 0.5000 0.6370 0.9143 0
LDA 0.28950 0.5519 0.6808 0.6678 0.7851 1.0000 0
>
> fullResamples <- resamples(list(RF = rfFull,
+ "Logistic Reg." = lrFull,
+ "SVM" = svmFull,
+ "$K$--NN" = knnFull,
+ "N. Bayes" = nbFull,
+ "LDA" = ldaFull))
> summary(fullResamples)
Call:
summary.resamples(object = fullResamples)
Models: RF, Logistic Reg., SVM, $K$--NN, N. Bayes, LDA
Number of resamples: 50
ROC
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
RF 0.7179 0.8528 0.8980 0.8904 0.9423 1.0000 0
Logistic Reg. 0.5214 0.7240 0.7951 0.7846 0.8612 0.9464 0
SVM 0.7143 0.8441 0.8938 0.8920 0.9611 1.0000 0
$K$--NN 0.7030 0.8047 0.8536 0.8494 0.9011 0.9737 0
N. Bayes 0.5263 0.7237 0.8036 0.7980 0.8690 1.0000 0
LDA 0.5357 0.7864 0.8571 0.8439 0.9059 0.9850 0
Sens
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
RF 0.0000 0.3080 0.4643 0.4496 0.5714 0.7143 0
Logistic Reg. 0.1429 0.5714 0.7143 0.6696 0.7143 1.0000 0
SVM 0.2857 0.5714 0.7143 0.6964 0.7500 1.0000 0
$K$--NN 0.0000 0.1295 0.1429 0.1957 0.2857 0.4286 0
N. Bayes 0.2500 0.4464 0.5714 0.5936 0.7143 0.8750 0
LDA 0.2500 0.5714 0.7143 0.6857 0.8304 1.0000 0
Spec
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
RF 0.9000 0.9625 1.0000 0.9847 1.0000 1 0
Logistic Reg. 0.4737 0.7368 0.7895 0.7779 0.8500 1 0
SVM 0.7368 0.9000 0.9474 0.9332 1.0000 1 0
$K$--NN 0.9474 1.0000 1.0000 0.9907 1.0000 1 0
N. Bayes 0.6316 0.7500 0.8000 0.8139 0.8947 1 0
LDA 0.6842 0.7500 0.8421 0.8294 0.8947 1 0
Accuracy
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
RF 0.7308 0.8148 0.8462 0.8383 0.8846 0.9231 0
Logistic Reg. 0.5185 0.6952 0.7692 0.7478 0.8077 0.8889 0
SVM 0.7407 0.8462 0.8709 0.8683 0.9155 0.9630 0
$K$--NN 0.6923 0.7500 0.7692 0.7731 0.8022 0.8519 0
N. Bayes 0.5714 0.7037 0.7692 0.7530 0.8077 0.8889 0
LDA 0.6667 0.7500 0.7778 0.7900 0.8462 0.9259 0
Kappa
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
RF 0.00000 0.3878 0.5229 0.5057 0.6609 0.7851 0
Logistic Reg. -0.19800 0.3292 0.4336 0.4170 0.5098 0.7417 0
SVM 0.28950 0.5702 0.6554 0.6527 0.7788 0.9065 0
$K$--NN -0.07216 0.1695 0.1980 0.2417 0.3573 0.5263 0
N. Bayes 0.02326 0.2863 0.4075 0.3966 0.5092 0.7235 0
LDA 0.10330 0.3741 0.4757 0.4910 0.6089 0.8224 0
>
> filteredResamples <- resamples(list("No Adjustment, Corr Vars" = rawCorr,
+ "No Adjustment, No Corr Vars" = rawNoCorr,
+ "Bonferroni, Corr Vars" = adjCorr,
+ "Bonferroni, No Corr Vars" = adjNoCorr))
> summary(filteredResamples)
Call:
summary.resamples(object = filteredResamples)
Models: No Adjustment, Corr Vars, No Adjustment, No Corr Vars, Bonferroni, Corr Vars, Bonferroni, No Corr Vars
Number of resamples: 50
ROC
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
No Adjustment, Corr Vars 0.7714 0.8647 0.9281 0.9168 0.9768 1 0
No Adjustment, No Corr Vars 0.7786 0.8816 0.9263 0.9180 0.9759 1 0
Bonferroni, Corr Vars 0.6643 0.8239 0.8531 0.8563 0.8970 1 0
Bonferroni, No Corr Vars 0.6643 0.8239 0.8531 0.8563 0.8970 1 0
Sens
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
No Adjustment, Corr Vars 0.2500 0.5848 0.7321 0.7439 0.8571 1 0
No Adjustment, No Corr Vars 0.3750 0.5714 0.7143 0.7357 0.8571 1 0
Bonferroni, Corr Vars 0.2857 0.5000 0.6250 0.6443 0.7500 1 0
Bonferroni, No Corr Vars 0.2857 0.5000 0.6250 0.6443 0.7500 1 0
Spec
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
No Adjustment, Corr Vars 0.7895 0.8947 0.9 0.9136 0.95 1 0
No Adjustment, No Corr Vars 0.7500 0.8500 0.9 0.9125 0.95 1 0
Bonferroni, Corr Vars 0.7368 0.8500 0.9 0.9083 0.95 1 0
Bonferroni, No Corr Vars 0.7368 0.8500 0.9 0.9083 0.95 1 0
Accuracy
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
No Adjustment, Corr Vars 0.7407 0.8462 0.8571 0.8670 0.8919 1 0
No Adjustment, No Corr Vars 0.7407 0.8226 0.8519 0.8638 0.8889 1 0
Bonferroni, Corr Vars 0.7037 0.7778 0.8462 0.8361 0.8846 1 0
Bonferroni, No Corr Vars 0.7037 0.7778 0.8462 0.8361 0.8846 1 0
Kappa
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
No Adjustment, Corr Vars 0.3193 0.5705 0.6609 0.6588 0.7390 1 0
No Adjustment, No Corr Vars 0.3549 0.5702 0.6414 0.6508 0.7381 1 0
Bonferroni, Corr Vars 0.2087 0.4343 0.5766 0.5663 0.6957 1 0
Bonferroni, No Corr Vars 0.2087 0.4343 0.5766 0.5663 0.6957 1 0
>
> sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] klaR_0.6-7 kernlab_0.9-16
[3] MASS_7.3-26 e1071_1.6-1
[5] class_7.3-7 pROC_1.5.4
[7] plyr_1.8 randomForest_4.6-7
[9] corrplot_0.71 RColorBrewer_1.0-5
[11] doMC_1.3.0 iterators_1.0.6
[13] foreach_1.4.0 caret_6.0-22
[15] ggplot2_0.9.3.1 lattice_0.20-15
[17] AppliedPredictiveModeling_1.1-5
loaded via a namespace (and not attached):
[1] car_2.0-16 codetools_0.2-8 colorspace_1.2-1 compiler_3.0.1
[5] CORElearn_0.9.41 dichromat_2.0-0 digest_0.6.3 grid_3.0.1
[9] gtable_0.1.2 labeling_0.1 munsell_0.4 proto_0.3-10
[13] reshape2_1.2.2 scales_0.2.3 stringr_0.6.2 tools_3.0.1
>
>
>
> proc.time()
user system elapsed
257587.585 7078.267 35323.717
%%R -w 600 -h 600
## runChapterScript(19)
## user system elapsed
## 257587.585 7078.267 35323.717